Skip to content
  • (+91) 9409548155
  • support@appdividend.com
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
Menu
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
Python

How to Extract a Text from a PDF File in Python

  • 23 Sep, 2024
  • Com 0
How to Extract Text from a PDF File in Python

Here are three ways to extract text from a PDF File in Python:

  1. Using the “PyMuPDF” library (For simple text or complex formatted text, including tables)
  2. Using the “tika” library
  3. Using pdf2image + pytesseract (OCR) (For PDF containing images)

Here are the three different PDFs we will use for this practical example:

  1. simple.pdf (Contains a simple textual paragraph)
  2. tabular.pdf (Contains text in the tabular format)
  3. image_based.pdf (Contains text in the format of images)

Three sample pdfs for the practical

In my current project directory, there is a folder called “pdfs,” and inside that folder, I have the three aforementioned PDFs.

Method 1: Using the “PyMuPDF” library

The “PyMuPDF”, also known as “fitz”, is a popular third-party high-performance library used to extract data from PDF files. It can handle text extraction, image extraction, searching, and more.

You can install the library using the command below:

pip install pymupdf

Installing pymupdf

Extracting simple text

Here is the screenshot of the “./pdfs/simple.pdf” file:

Simple PDF Screenshot

If you want to extract the above paragraph from a PDF, you can create a custom function and open a file using the “with fitz.open()” statement and read the content using the “.get_text()” method.

Keep in mind that “pymupdf” can be imported as “fitz” in a Python program:

import fitz  # PyMuPDF


def extract_text_pymupdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text


pdf_file = "./pdfs/simple.pdf"
print(extract_text_pymupdf(pdf_file))

Output

Sample PDF with Paragraph

This is a sample PDF file that contains a simple text paragraph. It serves as an example of how to
generate a PDF document with text using Python. PDFs are widely used for sharing documents in a
format that is independent of software, hardware, and operating systems.

We get all the content from that PDF without losing any information.

Extracting text in tabular format

How about the text in the “Tabular Format”. Many important pieces of information are constructed in a tabular format. Here is the screenshot of the “tabular.pdf” file’s content:

Screenshot of Tabular PDF

How do you read this file and extract its content? Here is the code for that as well.

import fitz  # PyMuPDF


def extract_formatted_text(pdf_path):
    with fitz.open(pdf_path) as doc:
        for page in doc:
            blocks = page.get_text("blocks")
            for block in blocks:
                print(f"Block: {block[4]}")
                print(f"Bounding Box: {block[:4]}")
                print("---")


pdf_file = "./pdfs/tabular.pdf"
print(extract_formatted_text(pdf_file))

Output

Block-by-Block output from tabular format

You can see from the above output image that we got each row in the form of a block of the table.

Extracting images from PDF

The standout benefit of the pymupdf library is that it allows you to extract an image from a PDF, whether the PDF contains a single image or multiple images.

You can pluck the images and save them in your current project folder.

Here is the screenshot of the “image_based.pdf” file that contains images:

Screenshot of image based PDF

The above PDF file contains 8 images, and we will extract all of them and save them in the current working directory using the code below:

import fitz  # PyMuPDF


# Extract images
def extract_images(pdf_path):
    with fitz.open(pdf_path) as doc:
        for i, page in enumerate(doc):
            image_list = page.get_images()
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = doc.extract_image(xref)
                image_bytes = base_image["image"]
                # Save the image
                with open(f"image_page{i+1}_{img_index+1}.png", "wb") as image_file:
                    image_file.write(image_bytes)


extract_images(pdf_file)
print("Extracted all eight images! Check your current project folder")

Output

Extracted all eight images! Check your current project folder

Extracted images

You can see all the images pulled from the PDF.

Method 2: Using the “tika” library

The “Apache tika” is also a well-known library to detect and extract content from not only PDF files but also various other file formats. It provides a “parser.fromfile()” function that returns the raw content of the file.

You can install the “tika” library using the command below:

npm install tika

You can use it in code like this:

from tika import parser


def extract_text_tika(pdf_path):
    raw = parser.from_file(pdf_path)
    return raw['content']


pdf_file = "./pdfs/tabular.pdf"

print(extract_text_tika(pdf_file))

Output

Output of using "tika" library

 

Method 3: Using pdf2image + pytesseract (OCR)

In real life, you generally not only come across textual or image-type PDFs but also scanned PDFs.

The scanned PDFs are those created by taking a screenshot of written pages and converting them into a PDF, from which you want to extract the text from the scanned images.

For these operations, we need to use the pdf2image and pytesseract libraries. Also, in your system, you must have installed Poppler and Tesseract.

Since I am using macOS, I can use Homebrew to install both of these packages using the command below:

brew install poppler

brew install tesseract

If you are using Windows or Linux OS, I would highly recommend that you do a quick Google search and install the packages mentioned above, as per your system.

After that, we can install the following Python-related third-party libraries using the command below:

pip install pdf2image pytesseract

For this approach, I will be using “image_based.pdf” because it is specifically designed to extract the content of the image within the PDF.

from pdf2image import convert_from_path
import pytesseract


def extract_text_ocr(pdf_path):
    images = convert_from_path(pdf_path)
    text = ""
    for image in images:
        text += pytesseract.image_to_string(image)
    return text


pdf_file = "./pdfs/image_based.pdf"

print(extract_text_ocr(pdf_file))

Output

Output of combining pdf2image and pytesseract

That’s all!

Post Views: 103
Share on:
Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.

How to Extract a Date from a String Using Python
Integrating React 18 in Laravel 11 [Step-by-step Guide]

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Address: TwinStar, South Block – 1202, 150 Ft Ring Road, Nr. Nana Mauva Circle, Rajkot(360005), Gujarat, India

Call: (+91) 9409548155

Email: support@appdividend.com

Online Platform

  • Pricing
  • Instructors
  • FAQ
  • Refund Policy
  • Support

Links

  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of services

Tutorials

  • Angular
  • React
  • Python
  • Laravel
  • Javascript
Copyright @2024 AppDividend. All Rights Reserved
Appdividend