How to Convert PDF to Text with OCR in PHP

Cloudmersive
3 min readApr 30, 2024

Converting image-based PDF documents to text requires Optical Character Recognition (OCR).

Using the ready-to-run PHP code examples provided below, we can take advantage of a free OCR API to perform high fidelity PDF to Text conversions. We’ll just need a free Cloudmersive API key to authorize our requests, and we’ll be able to make up to 800 API calls per month with no additional commitments.

We can structure our API call in two quick steps. First, let’s install the PHP client by executing the following command from our command line:

composer require cloudmersive/cloudmersive_ocr_api_client

Next, let’s copy the below code to call the function:

<?php
require_once(__DIR__ . '/vendor/autoload.php');

// Configure API key authorization: Apikey
$config = Swagger\Client\Configuration::getDefaultConfiguration()->setApiKey('Apikey', 'YOUR_API_KEY');



$apiInstance = new Swagger\Client\Api\PdfOcrApi(


new GuzzleHttp\Client(),
$config
);
$image_file = "/path/to/inputfile"; // \SplFileObject | PDF file to perform OCR on.
$recognition_mode = "recognition_mode_example"; // string | Optional; possible values are 'Basic' which provides basic recognition and is not resillient to page rotation, skew or low quality images uses 1-2 API calls per page; 'Normal' which provides highly fault tolerant OCR recognition uses 14-16 API calls per page; and 'Advanced' which provides the highest quality and most fault-tolerant recognition uses 28-30 API calls per page. Default recognition mode is 'Basic'
$language = "language_example"; // string | Optional, language of the input document, default is English (ENG). Possible values are ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)
$preprocessing = "preprocessing_example"; // string | Optional, preprocessing mode, default is 'Auto'. Possible values are None (no preprocessing of the image), and Auto (automatic image enhancement of the image before OCR is applied; this is recommended).

try {
$result = $apiInstance->pdfOcrPost($image_file, $recognition_mode, $language, $preprocessing);
print_r($result);
} catch (Exception $e) {
echo 'Exception when calling PdfOcrApi->pdfOcrPost: ', $e->getMessage(), PHP_EOL;
}
?>

We can now customize various aspects of our request to handle image skew, the language of text in the document (a variety of common languages are available to select from), and the overall image quality. It’s important to note that using $recognition_mode to improve resiliency to page rotation will increase the number of API calls used per operation.

Now we can easily convert image-based PDFs to text in smaller scale projects with just a few lines of PHP code.

--

--

Cloudmersive

There’s an API for that. Cloudmersive is a leader in Highly Scalable Cloud APIs.