If you are creating a large database of information in the world of AI, you need to extract an important piece of information from PDF files. It will help you organize searchable databases or indexes, making it easier to find specific information across multiple documents. You can then perform various analyses and generate reports to obtain meaningful data.
The data in a PDF file can be anything ranging from simple text, tabular format, images, or any unstructured format. In this article, we will extract data from any type of unstructured or irregular format and make it readable in the console.
Here are three ways to extract text from a PDF File in Python:
- Using the “PyMuPDF” library (For simple text or complex formatted text including table )
- Using the “tika” library
- Using pdf2image + pytesseract (OCR) (For PDF containing images)
Here are the three different PDFs we will use for this practical example:
- simple.pdf (Contains simple textual paragraph)
- tabular.pdf (Contains text in the tabular format)
- image_based.pdf (Contains text in the format of images)
In my current project directory, there is a folder called “pdfs” and inside that folder, I have these three above-mentioned pdfs.
Method 1: Using the “PyMuPDF” library
The “PyMuPDF”, also known as “fitz”, is a popular third-party high-performance library used to extract data from PDF files. It can handle text extraction, image extraction, searching, and more.
You can install the library using the command below:
pip install pymupdf
Extracting simple text
Here is the screenshot of the “./pdfs/simple.pdf” file:
If you want to extract the above paragraph from a PDF, you can create a custom function and open a file using the “with fitz.open()” statement and read the content using the “.get_text()” method.
Keep in mind that “pymupdf” can be imported as “fitz” in a Python program:
import fitz # PyMuPDF def extract_text_pymupdf(pdf_path): text = "" with fitz.open(pdf_path) as doc: for page in doc: text += page.get_text() return text pdf_file = "./pdfs/simple.pdf" print(extract_text_pymupdf(pdf_file))
Output
Sample PDF with Paragraph This is a sample PDF file that contains a simple text paragraph. It serves as an example of how to generate a PDF document with text using Python. PDFs are widely used for sharing documents in a format that is independent of software, hardware, and operating systems.
And we get all the content from that PDF without losing any information.
Extracting text in tabular format
How about the text in the “Tabular Format”. Many important pieces of information are constructed in a tabular format. Here is the screenshot of the “tabular.pdf” file’s content:
So, how do you read this file and extract the content? Here is the code for that as well.
import fitz # PyMuPDF def extract_formatted_text(pdf_path): with fitz.open(pdf_path) as doc: for page in doc: blocks = page.get_text("blocks") for block in blocks: print(f"Block: {block[4]}") print(f"Bounding Box: {block[:4]}") print("---") pdf_file = "./pdfs/tabular.pdf" print(extract_formatted_text(pdf_file))
Output
You can see from the above output image that we got each row in the form of a block of the table.
Extracting images from PDF
The standout benefit of the pymupdf library is that you can extract an image from a PDF if the PDF contains single or multiple images. You can pluck the images and save them in your current project folder.
Here is the screenshot of the “image_based.pdf” file that contains images:
The above PDF file contains 8 images and we will extract all of them and save them in the current working directory using the code below:
import fitz # PyMuPDF # Extract images def extract_images(pdf_path): with fitz.open(pdf_path) as doc: for i, page in enumerate(doc): image_list = page.get_images() for img_index, img in enumerate(image_list): xref = img[0] base_image = doc.extract_image(xref) image_bytes = base_image["image"] # Save the image with open(f"image_page{i+1}_{img_index+1}.png", "wb") as image_file: image_file.write(image_bytes) extract_images(pdf_file) print("Extracted all eight images! Check your current project folder")
Output
Extracted all eight images! Check your current project folder
You can see all the images pulled from the PDF.
Time and Space Complexities
- Time Complexity: O(n), where n is the number of pages in a PDF. If pages are more, it will take longer time. It also depends on the size of the complexity of the page because all pages are not created equally.
- Space Complexity: O(m), where m is the total content (text and images) of the PDF. It will require more space if there are more images to extract. It also depends on any additional overhead from the extraction process.
Pros
- This library is fast and outperforms other libraries in terms of execution.
- It also provides image extraction and searching.
- It supports various other formats such as .epub, .xps, and more.
Cons
- You need to install it separately as it is a third-party library.
- If you are a beginner in Python, then its API might be a learning curve and hard to grasp.
Specific usage
- It is suitable for large PDF file processing because it is fast.
- If your project requires the extraction of text and images, then I would highly recommend you use this approach.
Method 2: Using “tika” library
The “Apache tika” is also well known library to detect and extract content from not only PDF files but also various other file formats. It provides a “parser.fromfile()” function that returns the raw content of the file.
You can install the “tika” library using the command below:
npm install tika
You can use it in a code like this:
from tika import parser def extract_text_tika(pdf_path): raw = parser.from_file(pdf_path) return raw['content'] pdf_file = "./pdfs/tabular.pdf" print(extract_text_tika(pdf_file))
Output
Time and Space Complexities
- Time Complexity: O(n), where n is the file size. If the file is small, the response will be faster.
- Space Complexity: O(m), where m is the total content and metadata of the file. If the size is large, it requires more memory.
Pros
- It performs well with a complex layout and rich user experience.
- It can also extract metadata from a document.
- You can run this library on the server to improve its performance in the high-traffic scenario.
Cons
- If you are using this library for a “smaller task” then it will get overkill.
- Apache Tika requires a Java Runtime Environment (JRE), which might be a constraint in some environments.
- If you have a very large document then it might be a memory-intensive operation.
Specific usage
- If you are working with multiple file formats including PDF and others, this is the best choice for you.
- If your PDF files are in multiple languages like Spanish, German, and English, then it will help you extract the text properly.
Method 3: Using pdf2image + pytesseract (OCR)
In real life, you generally not only come across textual or image-type PDFs but also scanned PDFs. The scanned PDFs are those in which you take a screenshot of written pages and create a PDF out of it and you want to extract those text from the scanned images.
For these operations, we need to use pdf2image and pytesseract libraries. Also, in your system, you must have installed Poppler and Tesseract. Since I am using macOS, I can use homebrew to install both of these packages using the command below:
brew install poppler brew install tesseract
If you are using Windows or Linux OS, I would highly recommend you to do a quick Google search and install the above-mentioned packages as per your system.
After that, we can install the following Python-related third-party libraries using the command below:
pip install pdf2image pytesseract
For this approach, I will be using “image_based.pdf” because this method is specifically designed to extract the content of the image of the PDF.
from pdf2image import convert_from_path import pytesseract def extract_text_ocr(pdf_path): images = convert_from_path(pdf_path) text = "" for image in images: text += pytesseract.image_to_string(image) return text pdf_file = "./pdfs/image_based.pdf" print(extract_text_ocr(pdf_file))
Output
Time and Space Complexities
- Time Complexity: O(n * p), where n is the number of pages and p is the pixels per page. Since we are using the OCR technique, it will take a lot of time to process and return the output.
- Space Complexity: O(n * p + m), where m is the total text content. It also requires lots of memory usage since we are performing many operations.
Pros
- It works well if your PDF file is created based on images or screenshots.
Cons
- It is by far the slowest method.
- It requires additional software (Tesseract OCR engine) to be installed on your machine.
- It can generate errors while recognizing a character.
Specific usage
- If you come across a document in which you cannot read letters clearly or it is scanned badly and the quality is not good, you can use this approach because it is specialized in this kind of scenario.
That’s all!