How to Get Text from a PDF in Python

Cloudmersive
2 min readMar 21, 2024

--

We can easily get text from vector PDF documents page-by-page by calling a free API with Python code examples.

Our first step is installing the client SDK via pip install:

pip install cloudmersive-convert-api-client

After that, we can grab a free Cloudmersive API key to authorize our requests (this will allow a limit of 800 API calls per month with no additional commitments). We’ll copy this API key into the configuration.api_key variable in our next step.

Our final step is to add the imports and call the function using the ready-to-run Python code examples below (we can provide our PDF file path in the input_file variable to make our conversion):

from __future__ import print_function
import time
import cloudmersive_convert_api_client
from cloudmersive_convert_api_client.rest import ApiException
from pprint import pprint

# Configure API key authorization: Apikey
configuration = cloudmersive_convert_api_client.Configuration()
configuration.api_key['Apikey'] = 'YOUR_API_KEY'



# create an instance of the API class
api_instance = cloudmersive_convert_api_client.EditPdfApi(cloudmersive_convert_api_client.ApiClient(configuration))
input_file = '/path/to/inputfile' # file | Input file to perform the operation on.
text_formatting_mode = 'text_formatting_mode_example' # str | Optional; specify how whitespace should be handled when converting the document to text. Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases. Default is 'preserveWhitespace'. (optional)

try:
# Get text in a PDF document by page
api_response = api_instance.edit_pdf_get_pdf_text_by_pages(input_file, text_formatting_mode=text_formatting_mode)
pprint(api_response)
except ApiException as e:
print("Exception when calling EditPdfApi->edit_pdf_get_pdf_text_by_pages: %s\n" % e)

That’s all there is to it — this operation will extract text from each page of our PDF document and return that text relative to the page it came from.

--

--

Cloudmersive
Cloudmersive

Written by Cloudmersive

There’s an API for that. Cloudmersive is a leader in Highly Scalable Cloud APIs.

No responses yet