The .pdf and .doc or .docx are the most popular file types for managing files in a Python application. You can convert Word documents (.doc or .docx) to PDF format using several methods, each with advantages and trade-offs.
Here are three ways:
- Using docx2pdf
- Using unoconv with LibreOffice/OpenOffice
- Using python-docx and reportlab
For this practical, I will be using a sample.docx file that looks like this:
Method 1: Using docx2pdf
The most popular package to convert a doc file (.doc or .docx) to a PDF file is to use the “docx2pdf” package. This package is also platform-independent means on Windows, it leverages Microsoft Word via the COM interface for conversion, ensuring high fidelity. On macOS, it uses JXA (JavaScript for Automation), and on Linux, it falls back to LibreOffice.
Flowchart of docx2pdf process
You can install the package using pip:
pip install docx2pdf
Here is the complete code:
from docx2pdf import convert # Convert a single file convert("./sample.docx", "./document.pdf") print("Conversion from DOCX to PDF completed successfully!")
Output
Complexities
- Time Complexity: O(n), where n is the document’s size and application startup time (Microsoft Word or LibreOffice).
- Space Complexity: O(n), where n is the document’s size and the memory usage of Word or LibreOffice.
Pros
- It offers both command-line usage and Python API (which we used here).
- It is not limited to converting a single file, it can convert an entire directory containing doc files.
Cons
- The library does not work standalone. It requires Microsoft Word or LibreOffice bootup.
- It provides limited customization.
Specific usages
- If you are working on a Windows system or server, this library is highly compatible and works really well with MS Word.
- If you want to convert using a command-line tool, this is the approach you should go for.
Method 2: Using unoconv with LibreOffice/OpenOffice
The unoconv is LibreOffice’s uno binding for the conversions. It supports a wide range of formats and preserves formatting well.
You need to install LibreOffice in order to use this approach. Since I am using MacOS, I can install it using the command below:
# Install LibreOffice brew install --cask libreoffice # Install unoconv brew install unoconv
Make sure that you have installed the Homebrew package manager for Mac.
Flowchart of the unoconv process
Here is the complete code:
import subprocess def convert_with_unoconv(input_file, output_file): subprocess.call(['unoconv', '-f', 'pdf', '-o', output_file, input_file]) # Usage convert_with_unoconv('./sample.docx', './document.pdf') print("DOC file is converted into PDF")
Output
If everything is installed and setup correctly then you should be able to convert the Docx file into a PDF file.
Complexities
- Time Complexity: O(n), where n is proportional to the size and complexity of the docx file. It also requires the up and running of LibreOffice, so that time is also counted in this process.
- Space Complexity: O(n), where n depends on the size of the document. It also depends on how LibreOffice performs in memory.
Pros
- The LibreOffice approach can convert from any format to any format and is not limited to PDF or Docx format.
- It maintains the format of the docx file with high fidelity and accuracy while converting.
- You can create automation in batch processing, so it is automation automation-friendly approach.
- It is platform-independent means you can use this approach on “Windows”, “Linux”, or “MacOS”.
Cons
- It requires a full software installation of LibreOffice or OpenOffice which can be heavy for your system.
- If you are creating real-time conversions then this approach is not suitable because it will open LibreOffice every time you try to convert a file which can make systems very slow.
Specific usages
- If you want to convert multiple files in batch processing.
- In a scenario where you want to process a large number of doc files in an automated environment.
Method 3: Using python-docx and reportlab
The combination of python-docx and reportlab libraries can do the job by first reading and manipulating DOCX files using “python-docx” and the “reportlab” library to generate PDFs from that file.
Here is the step-by-step implementation:
- Read the .docx file using the python-docx library.
- Extract the text and images from the docx file.
- Generate a PDF using reportlab by adding the extracted content.
Flowchart of python-docx and reportlab process
Install both libraries using the command below:
pip install python-docx reportlab
Here is the complete Python code:
from docx import Document from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas def convert_with_python_docx(input_file, output_file): # Read the .docx file doc = Document(input_file) full_text = [] for para in doc.paragraphs: full_text.append(para.text) # Generate PDF pdf = canvas.Canvas(output_file, pagesize=letter) textobject = pdf.beginText(40, 750) for line in full_text: textobject.textLine(line) if textobject.getY() < 50: pdf.drawText(textobject) pdf.showPage() textobject = pdf.beginText(40, 750) pdf.drawText(textobject) pdf.save() # Usage convert_with_python_docx('./sample.docx', './document.pdf')
Output
Complexities
- Time Complexity: O(n + m), where n is the time to read the DOCX file and m is the time to generate the PDF content from that file.
- Space Complexity: O(n + m), both libraries do the process in memory, so n is for reading the content and requires its own space and m is to generate the content page-by-page which also requires its own space.
Pros
- This approach does not require fully-fledged applications like LibreOffice or MS Word installed on your machine. It is a pure Python library-based approach where you need to install libraries and start coding.
- It offers fine-grained control over the PDF output.
Cons
- It may not work well when the document file is consisting images or complex data structures.
- It requires significant programming knowledge to map DOCX elements to PDF.
Specific usages
- When you need to customize an output PDF file and high fidelity is not required.
- If you are dealing with dynamic data insertions then this approach is for you.
Conversion time for comparison for different approaches
The conversion time for each method is mentioned in the following image:
We used the “time” module to check the processing time for each method and I created a table based on my observation:
- The fastest method is “python-docx and reportlab”
- The second fastest is “docx2pdf with LibreOffice”
- The slowest method is “unoconv with LibreOffice”
Here is the chart based on the above observation:
That’s it!