How to Convert PDF to Text in C# .NET Framework
Non-rasterized PDFs often contain text objects we can extract from the document without OCR and use for other purposes. Unfortunately, writing code to pull this off in C# .NET can be a bit of a hassle.
Thankfully, using the code examples provided below, we can take advantage of a free API to handle this text extraction for us. In our request, we can even specify how we want whitespace to be handled in our conversion; we can elect to ‘preserveWhitespace’ (keeps whitespace and relative positioning of text) or ‘minimizeWhitespace’ (returns a block of text with no additional spaces inserted).
We can set up our API call in two quick steps.
First, let’s install the SDK. We can install via NuGet by running the following command in our Package Manager console:
Install-Package Cloudmersive.APIClient.NET.DocumentAndDataConvert -Version 3.4.2
Next, let’s call the function using the ready-to-run examples below. We’ll need to authorize our API calls with a free-tier API key, which will allow us to make up to 800 API calls per month (with no additional commitments):
using System;
using System.Diagnostics;
using Cloudmersive.APIClient.NET.DocumentAndDataConvert.Api;
using Cloudmersive.APIClient.NET.DocumentAndDataConvert.Client;
using Cloudmersive.APIClient.NET.DocumentAndDataConvert.Model;
namespace Example
{
public class ConvertDocumentPdfToTxtExample
{
public void main()
{
// Configure API key authorization: Apikey
Configuration.Default.AddApiKey("Apikey", "YOUR_API_KEY");
var apiInstance = new ConvertDocumentApi();
var inputFile = new System.IO.FileStream("C:\\temp\\inputfile", System.IO.FileMode.Open); // System.IO.Stream | Input file to perform the operation on.
var textFormattingMode = textFormattingMode_example; // string | Optional; specify how whitespace should be handled when converting PDF to text. Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases. Default is 'preserveWhitespace'. (optional)
try
{
// Convert PDF Document to Text (txt)
TextConversionResult result = apiInstance.ConvertDocumentPdfToTxt(inputFile, textFormattingMode);
Debug.WriteLine(result);
}
catch (Exception e)
{
Debug.Print("Exception when calling ConvertDocumentApi.ConvertDocumentPdfToTxt: " + e.Message );
}
}
}
}
Just like that, we’re all done — now we can quickly & easily pull plain text from certain PDF documents without using OCR in the process.