How to get Text in a PDF Document by Page using Java

2 min readJun 15, 2022

Using PDFs to export larger files (like PowerPoint presentations, for example) makes them easier to view or even print. However, once information is vectorized or rasterized in a PDF, extracting specific details like text can become tedious — especially for larger PDF files.

Thankfully, with the help of our PDF to Text API, you can easily extract text-only info from each page of a PDF file. The API will return text page-by-page with page number labels to ensure the list stays organized. To take advantage of this API, follow instructions below to structure your API call with Java.

Let’s begin by installing Maven. First, add the below reference to the repository in pom.xml:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

Next, let’s add a reference to the dependency in pom.xml:

<dependencies>
<dependency>
    <groupId>com.github.Cloudmersive</groupId>
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
    <version>v4.25</version>
</dependency>
</dependencies>

After that, include the import classes:

// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.EditPdfApi;

Now you can complete your API call with the remaining code. After satisfying the two parameters (your input file + your Cloudmersive API key), you may elect to specify the whitespace within the document as well.

ApiClient defaultClient = Configuration.getDefaultApiClient();// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");EditPdfApi apiInstance = new EditPdfApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
String textFormattingMode = "textFormattingMode_example"; // String | Optional; specify how whitespace should be handled when converting the document to text.  Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases.  Default is 'preserveWhitespace'.
try {
    PdfTextByPageResult result = apiInstance.editPdfGetPdfTextByPages(inputFile, textFormattingMode);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling EditPdfApi#editPdfGetPdfTextByPages");
    e.printStackTrace();
}

All done. First time using one of our APIs? To get your API key, head to our website and register a free account.

How to get Text in a PDF Document by Page using Java

Written by Cloudmersive

No responses yet