Many scanned documents have unwanted borders or margins that will waste the page. Cropping allows us to focus on the content making the document easier to read. But you need to be careful here because if you crop without any proper measurements, it will remove your main content too making it horrible for readers.
Here are four ways to crop PDF files in Python:
- Using PyPDF2
- Using PyMuPDF (fitz)
- Using pdf2image
- Using pdfrw
Before going further, I would like to show you the PDF we are working on in today’s practical.
Here is our “sample.pdf” file in my current working directory:
Method 1: Using PyPDF2
The PyPDF2 is a popular library for manipulating PDF files in Python. It’s working is very simple. It will read the PDF, adjust the media box of each page to crop it, and then write the modified pages into a new PDF file.
from PyPDF2 import PdfReader, PdfWriter def crop_pdf_pypdf2(input_path, output_path, crop_box): reader = PdfReader(input_path) writer = PdfWriter() for page in reader.pages: page.mediabox.lower_left = (crop_box[0], crop_box[1]) page.mediabox.upper_right = (crop_box[2], crop_box[3]) writer.add_page(page) with open(output_path, "wb") as output_file: writer.write(output_file) crop_pdf_pypdf2("./simple.pdf", "./cropped.pdf", (30,30,570,840)
Output before cropping
Output after cropping
You can see from both of the above images that we cropped the blank spaces from both sides, making it more focused on content.
(30, 30, 570, 840): It is a tuple that represents the crop box coordinates. These coordinates specify the region of the page to keep, defined as follows:
- 30: It is a distance from the left edge of the page to the left edge of the crop box.
- 30: It is a distance from the bottom edge of the page to the bottom edge of the crop box.
- 570: It is a distance from the left edge of the page to the right edge of the crop box.
- 840: It is a distance from the bottom edge of the page to the top edge of the crop box.
Complexities
- Time Complexity: O(n), where n is the number of pages in PDF.
- Space Complexity: O(n) because it loads the entire PDF into memory.
Pros
- It is very easy to use.
- It can handle various PDF versions.
Cons
- It may not handle all PDF features effortlessly.
Specific usage
- You can use it for simple cropping tasks.
- It is suitable for small to medium-sized PDFs.
Method 2: Using PyMuPDF (fitz)
The PyMuPDF is a high-performance library that processes PDFs efficiently by setting the crop box of each page directly.
Install the PyMuPDF library using the command below:
pip install pymupdf
The PyMuPDF library can be imported as “fitz” like this:
import fitz
It provides .set_cropbox() and fitz.Rect(crop_box) methods where cropbox is a tuple of dimensions from where to crop.
I highly recommend using this approach as it is very simple and more memory efficient.
import fitz def cropping_pdf_pymupdf(input_path, output_path, crop_box): doc = fitz.open(input_path) for page in doc: page.set_cropbox(fitz.Rect(crop_box)) doc.save(output_path) doc.close() cropping_pdf_pymupdf("./simple.pdf", "./cropped_fitz.pdf", (30, 30, 570, 840)) print("Cropped Successfully!")
Output before cropping
Output after cropping
Complexities
- Time complexity: O(n), where n is the number of pdf pages.
- Space complexity: O(1) because it processes one page at a time without loading the entire PDF into memory. This makes it more memory efficient than any other library.
Pros
- It handles better cropping when PDF contains images and graphics.
- It uses less memory.
- It can handle complex PDFs really well.
Cons
- It requires compilation of C-extension.
- It may have compatibility issues on some systems as well where environment setup is hard.
Specific usage
- If you are processing your PDFs batch-wise then this is the best approach.
Method 3: Using pdf2image
In this approach, we will convert the PDF into images using the pdf2image library to crop the pictures and combine them back into a PDF.
Let’s use a different pdf for this approach. We will use “image_based.pdf”.
Install the pdf2image libraries using the command below:
pip install pdf2image
You can import both libraries like this:
from pdf2image import convert_from_path
Now, you can use the pdf2image library’s function.
from pdf2image import convert_from_path def crop_pdf_pdf2image(input_path, output_path, crop_box): images = convert_from_path(input_path) cropped_images = [] for image in images: cropped_image = image.crop(crop_box) cropped_images.append(cropped_image) cropped_images[0].save(output_path, save_all=True, append_images=cropped_images[1:]) # Usage crop_pdf_pdf2image("./image_based.pdf", "./img2pdf_crop.pdf", (30, 30, 570, 840)) print("Cropped Successfully!")
Output before cropping
Output after cropping
Complexities
- Time Complexity: O(n * m), where n is the number of PDF pages and m is the pixel count of each page.
- Space Complexity: O(n * m), as it converts all pages to images in memory.
Pros
- It handles better cropping when PDF contains images and graphics.
- It provides more accurate cropping, especially for complex PDFs.
Cons
- It uses a higher memory since it works on pixel count and processes more due to conversion between PDF and image.
- It can export cropped PDFs in lower quality for text-heavy PDFs.
Specific usage
- Good when precise cropping is required because it works at pixel level.
Method 4: Using pdfrw
The pdfrw is another pure Python PDF library that crops PDFs by modifying the MediaBox of each page. It is lightweight, simple, and good for basic PDF operation.
You can install the “pdfrw” using the command below:
from pdfrw import PdfReader, PdfWriter def crop_pdf_pdfrw(input_path, output_path, crop_box): reader = PdfReader(input_path) writer = PdfWriter() for page in reader.pages: page.MediaBox = [float(x) for x in crop_box] writer.addpage(page) writer.write(output_path) # Usage input_pdf = "./simple.pdf" output_pdf = "./cropped_pdfrw.pdf" crop_box = (30, 30, 570, 840) # (left, bottom, right, top) crop_pdf_pdfrw(input_pdf, output_pdf, crop_box) print("Cropped Successfully!")
Output before cropping
Output after cropping
Complexities
- Time Complexity: O(n), where n is the number of pages.
- Space Complexity: O(n), as it loads the entire PDF into memory.
Pros
- Extremely good for basic PDF operation and a pure Python solution is required.
Cons
- It may not handle all PDF versions or complex structures.
Conclusion
After explaining four approaches, here is my final verdict:
- For simple, pure Python solutions, you can use PyPDF2 or pdfrw.
- If you are looking for high-performance and large PDFs for processing, use PyMuPDF.
- If your PDF contains complex layouts or when precision is crucial, use the pdf2image library.