How to Extract Text from a PDF by Page in Python
There are two efficient methods for extracting plain text from a PDF file: either by using OCR (optical character recognition) to read a PDF image, or by using a text extraction API to pull plain text from the vector file’s contents. The below API solution uses the latter method, programmatically stripping text from your PDF files and returning that text relative to the page it came from. You can use this API for free by copying and pasting from the ready-to-run Python code examples provided below.
The first step is to install the Python SDK. We can do so by running the following command:
pip install cloudmersive-convert-api-client
Next, we can copy in the remaining code, beginning with the imports:
from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint
# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'
# create an instance of the API class
api_instance = cloudmersive_convert_api_client.EditPdfApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.
text_formatting_mode = 'text_formatting_mode_example' # str | Optional; specify how whitespace should be handled when converting the document to text. Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases. Default is 'preserveWhitespace'. (optional)
try:
# Get text in a PDF document by page
api_response = api_instance.edit_pdf_get_pdf_text_by_pages(input_file, text_formatting_mode=text_formatting_mode)
pprint(api_response)
except ApiException as e:
print("Exception when calling EditPdfApi->edit_pdf_get_pdf_text_by_pages: %s\n" % e)
Within the above code examples, you can specify how whitespace should be handled (“preserve” or “minimize”) by setting values promoted in the code comments.
Your last step to complete your call will be entering a valid Cloudmersive API key, which you can get for free by registering a free account on our website.