Sitemap

How to Investigate .XML Files for Threats in Java

4 min readApr 29, 2025

--

XML files are harmless — they’re just structured text, right?

Hopefully nobody believes that! The reality is that XML uploads pose a major threat to any file upload workflow — especially when classic XML threats like external entities, schema abuse, or excessive nesting come into play. These real, everyday risks can lead to the disastrous outcomes we so often read about in security magazines — XML External Entity (XXE) attacks, denial-of-service conditions, or even data exfiltration.

Approaching XML Uploads the Right Way

If you can’t get around accepting.xml uploads in your Java application — whether that’s because you’re building out a data import tool, a configuration pipeline, or some multiformat document exchange workflow — it becomes extremely important to vet .xml file contents beyond schema validation. In this scenario, you can’t just say “no” to XML files, and you can’t just let them fly by either. Unsafe constructs or malformed content can easily slip through the cracks of our application portal if we're not explicitly scanning for them.

Thoroughly Investigating XML Files with an API

A dynamic document scanning API can help here. It inspects uploaded XML files for known attack patterns, dangerous entities, and invalid formatting — flagging issues before they touch the rest of your backend. Using the code examples provided below, it’s thankfully a straightforward integration that provides real safety for a deceptively risky file type.

To implement in our Maven project, we’ll first add the following reference to our pom.xml repository:

<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>

Then we’ll add the following reference to our pom.xml dependency:

<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>

Next, we’ll add the imports to the top of our file:

// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ScanApi;

With these initial steps out of the way, we’ll structure the rest of our API call. We’ll initialize the client, set up authorization via API key (we can get a free key on the Cloudmersive website with 800 API calls/month), configure our own custom scan options, and scan files using thescanFileAdvanced method:

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ScanApi apiInstance = new ScanApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
Boolean allowExecutables = true; // Boolean | Set to false to block executable files (program code) from being allowed in the input file. Default is false (recommended).
Boolean allowInvalidFiles = true; // Boolean | Set to false to block invalid files, such as a PDF file that is not really a valid PDF file, or a Word Document that is not a valid Word Document. Default is false (recommended).
Boolean allowScripts = true; // Boolean | Set to false to block script files, such as a PHP files, Python scripts, and other malicious content or security threats that can be embedded in the file. Set to true to allow these file types. Default is false (recommended).
Boolean allowPasswordProtectedFiles = true; // Boolean | Set to false to block password protected and encrypted files, such as encrypted zip and rar files, and other files that seek to circumvent scanning through passwords. Set to true to allow these file types. Default is false (recommended).
Boolean allowMacros = true; // Boolean | Set to false to block macros and other threats embedded in document files, such as Word, Excel and PowerPoint embedded Macros, and other files that contain embedded content threats. Set to true to allow these file types. Default is false (recommended).
Boolean allowXmlExternalEntities = true; // Boolean | Set to false to block XML External Entities and other threats embedded in XML files, and other files that contain embedded content threats. Set to true to allow these file types. Default is false (recommended).
Boolean allowInsecureDeserialization = true; // Boolean | Set to false to block Insecure Deserialization and other threats embedded in JSON and other object serialization files, and other files that contain embedded content threats. Set to true to allow these file types. Default is false (recommended).
Boolean allowHtml = true; // Boolean | Set to false to block HTML input in the top level file; HTML can contain XSS, scripts, local file accesses and other threats. Set to true to allow these file types. Default is false (recommended) [for API keys created prior to the release of this feature default is true for backward compatability].
String restrictFileTypes = "restrictFileTypes_example"; // String | Specify a restricted set of file formats to allow as clean as a comma-separated list of file formats, such as .pdf,.docx,.png would allow only PDF, PNG and Word document files. All files must pass content verification against this list of file formats, if they do not, then the result will be returned as CleanResult=false. Set restrictFileTypes parameter to null or empty string to disable; default is disabled.
try {
VirusScanAdvancedResult result = apiInstance.scanFileAdvanced(inputFile, allowExecutables, allowInvalidFiles, allowScripts, allowPasswordProtectedFiles, allowMacros, allowXmlExternalEntities, allowInsecureDeserialization, allowHtml, restrictFileTypes);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ScanApi#scanFileAdvanced");
e.printStackTrace();
}

At this point, we’re up and running. After we make our API calls, we can expect a response structure following the below example. If an XML upload carries XML external entities, for example, we can expect that to return true and trigger a “CleanResult”: false response.

{
"CleanResult": false,
"ContainsExecutable": false,
"ContainsInvalidFile": false,
"ContainsScript": false,
"ContainsPasswordProtectedFile": false,
"ContainsRestrictedFileFormat": false,
"ContainsMacros": false,
"ContainsXmlExternalEntities": true,
"ContainsInsecureDeserialization": false,
"ContainsHtml": false,
"ContainsUnsafeArchive": false,
"ContainsOleEmbeddedObject": false,
"VerifiedFileFormat": ".xml",
"FoundViruses": null,
"ContentInformation": null
}

--

--

Cloudmersive
Cloudmersive

Written by Cloudmersive

There’s an API for that. Cloudmersive is a leader in Highly Scalable Cloud APIs.

No responses yet