How to convert any Office document file into plain text in Python
Office document formats, while great for the editing stage, can be very unwieldy at inopportune moments later in the workflow. Oftentimes the best way to deal with this is to remove any potential problems with incompatibility and file size completely out of the equation. This can be achieved by simply converting these files into plain text. Let’s look at how this can be done quickly and easily.
Installing our API client comes first. We can do this using the following command here:
pip install cloudmersive-convert-api-client
Next we are going to set up our function call as you see in the below example. This will use an API instance, which will in turn require and API key. Then enter your file path, as well as optionally specify how you would like to deal with whitespace.
from __future__ import print_functionimport timeimport cloudmersive_convert_api_clientfrom cloudmersive_convert_api_client.rest import ApiExceptionfrom pprint import pprint# Configure API key authorization: Apikeyconfiguration = cloudmersive_convert_api_client.Configuration()configuration.api_key['Apikey'] = 'YOUR_API_KEY'# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed# configuration.api_key_prefix['Apikey'] = 'Bearer'# create an instance of the API classapi_instance = cloudmersive_convert_api_client.ConvertDocumentApi(cloudmersive_convert_api_client.ApiClient(configuration))input_file = '/path/to/file' # file | Input file to perform the operation on.text_formatting_mode = 'text_formatting_mode_example' # str | Optional; specify how whitespace should be handled when converting the document to text. Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases. Default is 'preserveWhitespace'. (optional)try:# Convert Document to Text (txt)api_response = api_instance.convert_document_autodetect_to_txt(input_file, text_formatting_mode=text_formatting_mode)pprint(api_response)except ApiException as e:print("Exception when calling ConvertDocumentApi->convert_document_autodetect_to_txt: %s\n" % e)
Now run it and you’re done! This API will automatically detect your file format, after which it will return the TXT version of that document. This covers many popular formats, including DOCX, XLSX, PPTX, and PDF, as well as legacy versions of those formats.