Converting DOCX to PDF in Python

The .pdf and .doc or .docx are the most popular file types for managing files in a Python application. You can convert Word documents (.doc or .docx) to PDF format using several methods, each with advantages and trade-offs.

Here are three ways:

Using docx2pdf
Using unoconv with LibreOffice/OpenOffice
Using python-docx and reportlab

For this practical, I will be using a sample.docx file that looks like this:

Method 1: Using docx2pdf

The most popular package to convert a doc file (.doc or .docx) to a PDF file is to use the “docx2pdf” package. This package is also platform-independent means on Windows, it leverages Microsoft Word via the COM interface for conversion, ensuring high fidelity. On macOS, it uses JXA (JavaScript for Automation), and on Linux, it falls back to LibreOffice.

Flowchart of docx2pdf process

You can install the package using pip:

pip install docx2pdf

Here is the complete code:

from docx2pdf import convert

# Convert a single file
convert("./sample.docx", "./document.pdf")

print("Conversion from DOCX to PDF completed successfully!")

Output

Complexities

Time Complexity: O(n), where n is the document’s size and application startup time (Microsoft Word or LibreOffice).
Space Complexity: O(n), where n is the document’s size and the memory usage of Word or LibreOffice.

Pros

It offers both command-line usage and Python API (which we used here).
It is not limited to converting a single file, it can convert an entire directory containing doc files.

Cons

The library does not work standalone. It requires Microsoft Word or LibreOffice bootup.
It provides limited customization.

Specific usages

If you are working on a Windows system or server, this library is highly compatible and works really well with MS Word.
If you want to convert using a command-line tool, this is the approach you should go for.

Method 2: Using unoconv with LibreOffice/OpenOffice

The unoconv is LibreOffice’s uno binding for the conversions. It supports a wide range of formats and preserves formatting well.

You need to install LibreOffice in order to use this approach. Since I am using MacOS, I can install it using the command below:

# Install LibreOffice
brew install --cask libreoffice

# Install unoconv
brew install unoconv

Make sure that you have installed the Homebrew package manager for Mac.

Flowchart of the unoconv process

Here is the complete code:

import subprocess


def convert_with_unoconv(input_file, output_file):
    subprocess.call(['unoconv', '-f', 'pdf', '-o', output_file, input_file])


# Usage
convert_with_unoconv('./sample.docx', './document.pdf')
print("DOC file is converted into PDF")

Output

If everything is installed and setup correctly then you should be able to convert the Docx file into a PDF file.

Complexities

Time Complexity: O(n), where n is proportional to the size and complexity of the docx file. It also requires the up and running of LibreOffice, so that time is also counted in this process.
Space Complexity: O(n), where n depends on the size of the document. It also depends on how LibreOffice performs in memory.

Pros

The LibreOffice approach can convert from any format to any format and is not limited to PDF or Docx format.
It maintains the format of the docx file with high fidelity and accuracy while converting.
You can create automation in batch processing, so it is automation automation-friendly approach.
It is platform-independent means you can use this approach on “Windows”, “Linux”, or “MacOS”.

Cons

It requires a full software installation of LibreOffice or OpenOffice which can be heavy for your system.
If you are creating real-time conversions then this approach is not suitable because it will open LibreOffice every time you try to convert a file which can make systems very slow.

Specific usages

If you want to convert multiple files in batch processing.
In a scenario where you want to process a large number of doc files in an automated environment.

Method 3: Using python-docx and reportlab

The combination of python-docx and reportlab libraries can do the job by first reading and manipulating DOCX files using “python-docx” and the “reportlab” library to generate PDFs from that file.

Here is the step-by-step implementation:

Read the .docx file using the python-docx library.
Extract the text and images from the docx file.
Generate a PDF using reportlab by adding the extracted content.

Flowchart of python-docx and reportlab process

Install both libraries using the command below:

pip install python-docx reportlab

Here is the complete Python code:

from docx import Document
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas


def convert_with_python_docx(input_file, output_file):
    # Read the .docx file
    doc = Document(input_file)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)

    # Generate PDF
    pdf = canvas.Canvas(output_file, pagesize=letter)
    textobject = pdf.beginText(40, 750)
    for line in full_text:
        textobject.textLine(line)
        if textobject.getY() < 50:
            pdf.drawText(textobject)
            pdf.showPage()
            textobject = pdf.beginText(40, 750)
    pdf.drawText(textobject)
    pdf.save()


# Usage
convert_with_python_docx('./sample.docx', './document.pdf')

Output

Complexities

Time Complexity: O(n + m), where n is the time to read the DOCX file and m is the time to generate the PDF content from that file.
Space Complexity: O(n + m), both libraries do the process in memory, so n is for reading the content and requires its own space and m is to generate the content page-by-page which also requires its own space.

Pros

This approach does not require fully-fledged applications like LibreOffice or MS Word installed on your machine. It is a pure Python library-based approach where you need to install libraries and start coding.
It offers fine-grained control over the PDF output.

Cons

It may not work well when the document file is consisting images or complex data structures.
It requires significant programming knowledge to map DOCX elements to PDF.

Specific usages

When you need to customize an output PDF file and high fidelity is not required.
If you are dealing with dynamic data insertions then this approach is for you.

Conversion time for comparison for different approaches

The conversion time for each method is mentioned in the following image:

We used the “time” module to check the processing time for each method and I created a table based on my observation:

The fastest method is “python-docx and reportlab”
The second fastest is “docx2pdf with LibreOffice”
The slowest method is “unoconv with LibreOffice”

Here is the chart based on the above observation:

That’s it!

Post Views: 398

Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.