How to Identify Fake PDFs using an API in Nodejs

Cloudmersive
5 min readMay 24, 2024

Creating malicious, fake PDFs with HTML and JavaScript is easier than we might think.

Sophisticated threat actors can create extremely convincing fake PDFs from scratch — and these fake documents will often display valid PDF extensions, allowing them to open in much the same way a real PDF would.

If we open a fake PDF in our web browser (or a vulnerable PDF processing application), we might unwittingly execute an attacker’s code. This can, for instance, create a remote connection with the attacker’s server, allowing them to download malicious content onto our device.

One way we can identify fake PDFs is by applying deterministic (rules-based) threat detection policies. In this case, deterministic threat detection entails verifying the PDF contents to ensure they rigorously conform with PDF file formatting standards.

Carrying through our prior example, the attacker’s fake PDF might present an extension that appears valid, but a deterministic scan will identify that the contents of the fake PDF don’t actually meet stringent PDF formatting standards.

Using the below code, we can take advantage of a free API in our Nodejs file upload forms that deterministically checks file uploads for content threats, and simultaneously performs a signature-based virus and malware scan.

The deterministic scan includes checking for invalid files (like a fake PDF), and it also includes identifying malicious content types such as macros, executables, unsafe archives (e.g., zip bombs), password-protected files (commonly used to disguise malicious code), and more.

To authorize our API calls, we’ll need a free API key. This will allow us to make up to 800 API calls per month with zero commitments (perfect for smaller-scale projects).

Our first step is to install the client SDK. We can run the following NPM command:

npm install cloudmersive-virus-api-client --save

Or we can add the Node client to our package.json:

  "dependencies": {
"cloudmersive-virus-api-client": "^1.1.9"
}

Next, we can use the below code examples to structure our function call. We can replace the placeholder ‘YOUR API KEY’ string with our own API key string:

var CloudmersiveVirusApiClient = require('cloudmersive-virus-api-client');
var defaultClient = CloudmersiveVirusApiClient.ApiClient.instance;

// Configure API key authorization: Apikey
var Apikey = defaultClient.authentications['Apikey'];
Apikey.apiKey = 'YOUR API KEY';



var apiInstance = new CloudmersiveVirusApiClient.ScanApi();

var inputFile = Buffer.from(fs.readFileSync("C:\\temp\\inputfile").buffer); // File | Input file to perform the operation on.

var opts = {
'allowExecutables': true, // Boolean | Set to false to block executable files (program code) from being allowed in the input file. Default is false (recommended).
'allowInvalidFiles': true, // Boolean | Set to false to block invalid files, such as a PDF file that is not really a valid PDF file, or a Word Document that is not a valid Word Document. Default is false (recommended).
'allowScripts': true, // Boolean | Set to false to block script files, such as a PHP files, Python scripts, and other malicious content or security threats that can be embedded in the file. Set to true to allow these file types. Default is false (recommended).
'allowPasswordProtectedFiles': true, // Boolean | Set to false to block password protected and encrypted files, such as encrypted zip and rar files, and other files that seek to circumvent scanning through passwords. Set to true to allow these file types. Default is false (recommended).
'allowMacros': true, // Boolean | Set to false to block macros and other threats embedded in document files, such as Word, Excel and PowerPoint embedded Macros, and other files that contain embedded content threats. Set to true to allow these file types. Default is false (recommended).
'allowXmlExternalEntities': true, // Boolean | Set to false to block XML External Entities and other threats embedded in XML files, and other files that contain embedded content threats. Set to true to allow these file types. Default is false (recommended).
'allowInsecureDeserialization': true, // Boolean | Set to false to block Insecure Deserialization and other threats embedded in JSON and other object serialization files, and other files that contain embedded content threats. Set to true to allow these file types. Default is false (recommended).
'allowHtml': true, // Boolean | Set to false to block HTML input in the top level file; HTML can contain XSS, scripts, local file accesses and other threats. Set to true to allow these file types. Default is false (recommended) [for API keys created prior to the release of this feature default is true for backward compatability].
'restrictFileTypes': "restrictFileTypes_example" // String | Specify a restricted set of file formats to allow as clean as a comma-separated list of file formats, such as .pdf,.docx,.png would allow only PDF, PNG and Word document files. All files must pass content verification against this list of file formats, if they do not, then the result will be returned as CleanResult=false. Set restrictFileTypes parameter to null or empty string to disable; default is disabled.
};

var callback = function(error, data, response) {
if (error) {
console.error(error);
} else {
console.log('API called successfully. Returned data: ' + data);
}
};
apiInstance.scanFileAdvanced(inputFile, opts, callback);

Note that we can set custom threat rules to allow certain content types if we don’t want to categorically block them. We can also restrict file upload types by providing a comma-separated list of acceptable file extensions in the ‘restrictFileTypes’ parameter (e.g., ‘.pdf,.docx,.jpg’).

The below response object is an example taken from scanning an inert JavaScript PDF test file (this test file would simply display “you’ve been hacked!” if opened in a web browser):

{
"CleanResult": false,
"ContainsExecutable": false,
"ContainsInvalidFile": true,
"ContainsScript": false,
"ContainsPasswordProtectedFile": false,
"ContainsRestrictedFileFormat": false,
"ContainsMacros": false,
"ContainsXmlExternalEntities": false,
"ContainsInsecureDeserialization": false,
"ContainsHtml": false,
"ContainsUnsafeArchive": false,
"ContainsOleEmbeddedObject": false,
"VerifiedFileFormat": ".pdf",
"FoundViruses": null,
"ContentInformation": {
"ContainsJSON": false,
"ContainsXML": false,
"ContainsImage": false,
"RelevantSubfileName": null
}
}

Note that the “CleanResult”: false response indicates a threat was identified, and the “ContainsInvalidFile”: true response tells us what type of threat triggered that response. The “VerifiedFileFormat”: ‘.pdf’ response indicates the invalid file contained a recognized extension, meaning our PDF rendering applications would’ve attempted to open the file like any other PDF.

Now we can easily identify fake PDFs and other hidden threats in our Node.js file upload forms.

--

--

Cloudmersive

There’s an API for that. Cloudmersive is a leader in Highly Scalable Cloud APIs.