Whether you want to create a short PDF from the original PDF or remove unnecessary content, you must delete pages to make it more lightweight. Furthermore, it reduces the file size, making memory management more efficient.
Here are three ways to delete pages from PDF using Python:
- Using pymupdf
- Using pypdf2
- Using pdfrw
For this practical implementation, we will use the five-page PDF like this:
We will remove some of the pages from the above PDF for the demonstration.
Here is the file: sample_5_pages.
Method 1: Using PyMuPDF
If you are looking for a memory-efficient solution among many PDF libraries then I highly recommend using the PyMuPDF library. It provides a .delete_page() function that will accept the index as a page number and remove it.
You can install it using the “pip”:
pip install pymupdf
You can import it as “fitz” like this code:
import fitz
Here is the complete code:
import fitz def delete_pages(input_path, output_path, pages_to_delete): doc = fitz.open(input_path) pages_to_delete.sort(reverse=True) for page_num in pages_to_delete: doc.delete_page(page_num - 1) # 0-indexed doc.save(output_path) doc.close() # Usage input_path = 'sample_5_pages.pdf' output_path = 'reduced.pdf' pages_to_delete = [1, 3, 5] # Page numbers to delete (1-indexed) delete_pages(input_path, output_path, pages_to_delete) print("Pages no 1, 3, and 5 deleted successfully")
Output
As illustrated in the screenshot above, our output PDF has only 2 pages, page numbers 2 and 4. Page numbers 1, 3, and 5 have been deleted successfully.
Complexities
- Time Complexity: O(n), where n is the number of pages to delete from a PDF file.
- Space Complexity: O(1) because it operates on the PDF file directly which will save space, making it more efficient.
Pros
- It operates blazingly fast and is memory efficient.
- It can handle large PDF files, so you don’t require any special treatment for that.
- Not only does it delete pages but also you can extract text from it or perform various operations.
Cons
- PyMuPDF is a third-party library, so it is an external dependency.
- If you are looking for simple operations then you don’t need external dependency. It will be overkill.
Method 2: Using PyPDF2
The most popular library to use for simple operations is PyPDF2. If you are looking for a solution where you need to remove specific pages rather than range, then this is the approach you should go for.
import PyPDF2 # Custom function to delete pages def delete_pages(input_path, output_path, pages_to_delete): with open(input_path, 'rb') as file: pdf_reader = PyPDF2.PdfReader(file) pdf_writer = PyPDF2.PdfWriter() for page_num in range(len(pdf_reader.pages)): if page_num + 1 not in pages_to_delete: page = pdf_reader.pages[page_num] pdf_writer.add_page(page) with open(output_path, 'wb') as output_file: pdf_writer.write(output_file) # Calling the custom function input_path = 'sample_5_pages.pdf' output_path = 'reduced.pdf' pages_to_delete = [1, 2, 3, 5] # Page numbers to delete (1-indexed) delete_pages(input_path, output_path, pages_to_delete) print("Pages number 1, 2, 3, and 5 have been deleted")
Output
From the image above, it’s clear that we removed four pages from the PDF and only 1 page is remaining which is number 4.
Complexities
- Time Complexity: O(n), where n is the number of pages in the PDF.
- Space Complexity: O(n), because it needs to store the entire PDF in memory while processing.
Pros
- It works well with small-to-medium-sized PDFs.
- It provides a simple API to work with.
Cons
- It is not as memory efficient as PyMuPDF.
- It loads an entire file into memory, so it becomes slow for large PDF files.
Method 3: Using pdfrw
The pdfrw is a third-party PDF library that has a unique usecase. When you want to preserve the original PDF structure while operating, you should use this library.
You can install the pdfrw library using the command below:
pip install pdfrw
Here is the complete Python code:
from pdfrw import PdfReader, PdfWriter def delete_pages(input_path, output_path, pages_to_delete): reader = PdfReader(input_path) writer = PdfWriter() for page_num, page in enumerate(reader.pages, 1): if page_num not in pages_to_delete: writer.addpage(page) writer.write(output_path) # Calling the custom function input_path = 'sample_5_pages.pdf' output_path = 'reduced.pdf' pages_to_delete = [3, 4, 5] # Page numbers to delete (1-indexed) delete_pages(input_path, output_path, pages_to_delete) print("Pages number 3, 4, and 5 have been deleted")
Output
You can tell from the above screenshot that we generated a new PDF where the last 3 pages are not there.
Complexities
- Time Complexity: O(n)
- Space Complexity: O(n)
Pros
- Faster than PyPDF2
Cons
- Not as good at memory efficiency and feature-rich as PyMuPDF
- It can struggle if the PDF has a complex structure.
Final analysis
Which library to choose always depends on which type of requirement you have.
- If performance and speed are priorities, use the “PyMuPDF”.
- For small to medium PDFs with simple structures, use the “PyPDF2” or “pdfr2”.