How to Crop PDF Files with Python

Many scanned documents have unwanted borders or margins that will waste the page. Cropping allows us to focus on the content making the document easier to read. But you need to be careful here because if you crop without any proper measurements, it will remove your main content too making it horrible for readers.

Here are four ways to crop PDF files in Python:

Using PyPDF2
Using PyMuPDF (fitz)
Using pdf2image
Using pdfrw

Before going further, I would like to show you the PDF we are working on in today’s practical.

Here is our “sample.pdf” file in my current working directory:

Method 1: Using PyPDF2

The PyPDF2 is a popular library for manipulating PDF files in Python. It’s working is very simple. It will read the PDF, adjust the media box of each page to crop it, and then write the modified pages into a new PDF file.

from PyPDF2 import PdfReader, PdfWriter


def crop_pdf_pypdf2(input_path, output_path, crop_box):
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        page.mediabox.lower_left = (crop_box[0], crop_box[1])
        page.mediabox.upper_right = (crop_box[2], crop_box[3])
        writer.add_page(page)

    with open(output_path, "wb") as output_file:
        writer.write(output_file)


crop_pdf_pypdf2("./simple.pdf", "./cropped.pdf", (30,30,570,840)

Output before cropping

Output after cropping

You can see from both of the above images that we cropped the blank spaces from both sides, making it more focused on content.

(30, 30, 570, 840): It is a tuple that represents the crop box coordinates. These coordinates specify the region of the page to keep, defined as follows:

30: It is a distance from the left edge of the page to the left edge of the crop box.
30: It is a distance from the bottom edge of the page to the bottom edge of the crop box.
570: It is a distance from the left edge of the page to the right edge of the crop box.
840: It is a distance from the bottom edge of the page to the top edge of the crop box.

Complexities

Time Complexity: O(n), where n is the number of pages in PDF.
Space Complexity: O(n) because it loads the entire PDF into memory.

Pros

It is very easy to use.
It can handle various PDF versions.

Cons

It may not handle all PDF features effortlessly.

Specific usage

You can use it for simple cropping tasks.
It is suitable for small to medium-sized PDFs.

Method 2: Using PyMuPDF (fitz)

The PyMuPDF is a high-performance library that processes PDFs efficiently by setting the crop box of each page directly.

Install the PyMuPDF library using the command below:

pip install pymupdf

The PyMuPDF library can be imported as “fitz” like this:

import fitz

It provides .set_cropbox() and fitz.Rect(crop_box) methods where cropbox is a tuple of dimensions from where to crop.

I highly recommend using this approach as it is very simple and more memory efficient.

import fitz


def cropping_pdf_pymupdf(input_path, output_path, crop_box):
    doc = fitz.open(input_path)
    for page in doc:
        page.set_cropbox(fitz.Rect(crop_box))
    doc.save(output_path)
    doc.close()


cropping_pdf_pymupdf("./simple.pdf",
                 "./cropped_fitz.pdf", (30, 30, 570, 840))

print("Cropped Successfully!")

Output before cropping

Output after cropping

Complexities

Time complexity: O(n), where n is the number of pdf pages.
Space complexity: O(1) because it processes one page at a time without loading the entire PDF into memory. This makes it more memory efficient than any other library.

Pros

It handles better cropping when PDF contains images and graphics.
It uses less memory.
It can handle complex PDFs really well.

Cons

It requires compilation of C-extension.
It may have compatibility issues on some systems as well where environment setup is hard.

Specific usage

If you are processing your PDFs batch-wise then this is the best approach.

Method 3: Using pdf2image

In this approach, we will convert the PDF into images using the pdf2image library to crop the pictures and combine them back into a PDF.

Let’s use a different pdf for this approach. We will use “image_based.pdf”.

Install the pdf2image libraries using the command below:

pip install pdf2image

You can import both libraries like this:

from pdf2image import convert_from_path

Now, you can use the pdf2image library’s function.

from pdf2image import convert_from_path

def crop_pdf_pdf2image(input_path, output_path, crop_box):
    images = convert_from_path(input_path)
    cropped_images = []

    for image in images:
        cropped_image = image.crop(crop_box)
        cropped_images.append(cropped_image)

    cropped_images[0].save(output_path, save_all=True, append_images=cropped_images[1:])

# Usage
crop_pdf_pdf2image("./image_based.pdf", "./img2pdf_crop.pdf", (30, 30, 570, 840))
print("Cropped Successfully!")

Output before cropping

Output after cropping

Complexities

Time Complexity: O(n * m), where n is the number of PDF pages and m is the pixel count of each page.
Space Complexity: O(n * m), as it converts all pages to images in memory.

Pros

It handles better cropping when PDF contains images and graphics.
It provides more accurate cropping, especially for complex PDFs.

Cons

It uses a higher memory since it works on pixel count and processes more due to conversion between PDF and image.
It can export cropped PDFs in lower quality for text-heavy PDFs.

Specific usage

Good when precise cropping is required because it works at pixel level.

Method 4: Using pdfrw

The pdfrw is another pure Python PDF library that crops PDFs by modifying the MediaBox of each page. It is lightweight, simple, and good for basic PDF operation.

You can install the “pdfrw” using the command below:

from pdfrw import PdfReader, PdfWriter


def crop_pdf_pdfrw(input_path, output_path, crop_box):
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        page.MediaBox = [float(x) for x in crop_box]
        writer.addpage(page)

    writer.write(output_path)


# Usage
input_pdf = "./simple.pdf"
output_pdf = "./cropped_pdfrw.pdf"
crop_box = (30, 30, 570, 840) # (left, bottom, right, top)

crop_pdf_pdfrw(input_pdf, output_pdf, crop_box)
print("Cropped Successfully!")

Output before cropping

Output after cropping

Complexities

Time Complexity: O(n), where n is the number of pages.
Space Complexity: O(n), as it loads the entire PDF into memory.

Pros

Extremely good for basic PDF operation and a pure Python solution is required.

Cons

It may not handle all PDF versions or complex structures.

Conclusion

After explaining four approaches, here is my final verdict:

For simple, pure Python solutions, you can use PyPDF2 or pdfrw.
If you are looking for high-performance and large PDFs for processing, use PyMuPDF.
If your PDF contains complex layouts or when precision is crucial, use the pdf2image library.

Post Views: 291

Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.