How to Extract Text from XML File in Python

Here are four different ways to extract text from XML files in Python:

Using “xml.etree.ElementTree” built-in module
Using “lxml” (third-party library)
Using “xmltodict” (for converting XML text into a dictionary)
Using the “re” module (Regular Expressions)

To extract the data, we need a proper “XML” file. For this project, we will use the “books.xml” file that looks like the image below:

The above file has a valid XML structure.

Method 1: Using the “xml.etree.ElementTree” module

The “xml.etree.ElementTree” is a built-in module that provides a parse() function that will accept an “XML” file, and you can get the text from that file using the iter() function.

import xml.etree.ElementTree as ET

tree = ET.parse('books.xml')
root = tree.getroot()

for element in root.iter():
    print(element.text)

Output

The Great Gatsby
F. Scott Fitzgerald
1925
10.99


A Brief History of Time
Stephen Hawking
1988
14.99


Cien años de soledad
Gabriel García Márquez
1967
12.99

As expected, we obtained the exact output we wanted. Just DATA!

Method 2: Using “lxml”

If you are looking for a third-party solution, then I would highly recommend using the “lxml” library. It provides etree.parse() method that accepts an XML file, use the .getroot() method to get the root of the file, and finally, use the .iter() and .text() methods to extract the content.

You can install the “lxml” library using the command below:

pip install lxml

Here is the code:

from lxml import etree

tree = etree.parse('books.xml')
root = tree.getroot()

for element in root.iter():
    print(element.text)

Output

The Great Gatsby
F. Scott Fitzgerald
1925
10.99


A Brief History of Time
Stephen Hawking
1988
14.99


Cien años de soledad
Gabriel García Márquez
1967
12.99

Method 3: Using “xmltodict”

The “xmltodict” is a third-party library specifically used when you want your extracted data to be represented as a Python dictionary. It provides a parse () method that reads the XML file and returns the dictionary.

import xmltodict

with open('books.xml', 'r') as file:
    data = xmltodict.parse(file.read())
    print(data)

Output

{'bookstore': {'book': [{'@category': 'fiction', 'title': {'@lang': 'en', '#text': 'The Great Gatsby'}, 'author': 'F. Scott Fitzgerald', 'year': '1925', 'price': '10.99'}, {'@category': 'non-fiction', 'title': {'@lang': 'en', '#text': 'A Brief History of Time'}, 'author': 'Stephen Hawking', 'year': '1988', 'price': '14.99'}, {'@category': 'fiction', 'title': {'@lang': 'es', '#text': 'Cien años de soledad'}, 'author': 'Gabriel García Márquez', 'year': '1967', 'price': '12.99'}]}}

Method 4: Using the “re” module (regex)

Regular expressions are a de facto standard for finding and extracting elements from a file, string, or any other object. You can use the re.findall() method to get the exact data you are looking for in an XML file.

If you are looking to extract specific pieces of data from a large XML file without parsing the entire structure, I recommend using the “regular expression” approach.

import re


def parse_xml_with_regex(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Updated regex pattern to capture entire book elements
    book_pattern = r'<book.*?>(.*?)</book>'
    books_raw = re.findall(book_pattern, content, re.DOTALL)

    # Process each book
    books = []
    for book_content in books_raw:
        book = {}
        # Parse individual fields within each book
        fields = re.findall(r'<(\w+).*?>(.*?)</\1>', book_content, re.DOTALL)
        for tag, value in fields:
            book[tag] = value.strip()
        books.append(book)

    return books


# Use the function
books = parse_xml_with_regex('books.xml')

# Print all books
print(f"Total number of books: {len(books)}")
for i, book in enumerate(books, 1):
    print(f"\nBook {i}:")
    for key, value in book.items():
        print(f"  {key}: {value}")

print("\nAll books have been printed.")

Output

That’s all!

Post Views: 670

Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.