How to Get Text from a PDF by Page in Power Automate

Cloudmersive
4 min readOct 24, 2024

--

Text-based PDFs (like vector PDFs, for example) store plain text as a series of objects within their file structure. That means we can extract text from text-based PDFs programmatically without using Optical Character Recognition (OCR).

Thankfully, there’s an easy way to retrieve text from PDFs — and it doesn’t involve writing any code. We can simply use the Cloudmersive PDF connector in Power Automate to extract text from PDFs by page; this returns an array of page numbers and text contents from each respective page in our original PDF document.

In this article, we’ll walk through a quick instant cloud flow that demonstrates how to get PDF text by page using the Cloudmersive PDF connector.

We’ll first add a Get file content action to grab a PDF file from our folder system. In my example flow, I’ll be getting text from a four-page PDF document containing Lorem Ipsum text content.

Next up, we’ll add our Cloudmersive PDF connector action. To find the Cloudmersive PDF connector, we’ll type “Cloudmersive” into the search bar, and we’ll then locate the correct connector option from the list below.

To get to the actions list, we’ll click “See more”, and from there we’ll search for an action called Get text in a PDF document by page. If there are other PDF workflows we’re looking to automate, we can take a moment to review some of the other actions on this list.

When we select this action, we’ll need to create and authorize our Cloudmersive connection before we fill any request parameters. As long as we have a premium Power Automate license, we can use Cloudmersive connectors for free with a free API key. These allow a limit of 800 API calls per month with no commitments (our total will reset each month).

Structuring our request is easy — we’ll simply add our PDF file bytes and a file name into each respect parameter.

This action returns a Pages array containing text from each page in our PDF relative to its page number. We can see this in the below response model.

{
"Successful": true,
"Pages": [
{
"PageNumber": 0,
"PageText": "string"
}
]
}

That means we can use the Apply to each action from the Control connector to generate multiple new text files, or we can slice up our array using one of the many Collection functions Power Automate provides for us.

It’s worth noting as well that that the text content we’re retrieving is extremely raw, lacking all the visual structure it had in our original PDF document display. As such, we shouldn’t expect this text to look pretty; rather, we should expect something functional containing line and paragraph breaks. We’ll notice this if we test our flow and review the raw outputs.

Retrieving text in this way is a great precursor to performing sentiment analysis, or any other Natural Language Processing (NLP) classification. Additionally, if our PDF contains raw data on one page, this is a great way to get that raw data directly without first converting our PDF to another format.

--

--

Cloudmersive
Cloudmersive

Written by Cloudmersive

There’s an API for that. Cloudmersive is a leader in Highly Scalable Cloud APIs.

No responses yet