How to Extract Text from XML File in Python

Many traditional companies still use XML (Extended Markup Language) in web services. You can integrate the content of XML files into your application or create data points to analyze the data and that’s where data extraction from XML files comes into play.

Once you extract the text, you can apply various machine-learning algorithms to create LLMs (Large Language Models).

Here are four different ways to extract text from XML files in Python:

Using “xml.etree.ElementTree” built-in module
Using “lxml” (third-party library)
Using “xmltodict” (for converting XML text into a dictionary)
Using the “re” module (Regular Expressions)

To extract the data, we need a proper “XML” file. For this project, we will use the “books.xml” file that looks like the below image:

The above file has a valid XML structure.

Method 1: Using the “xml.etree.ElementTree” module

The “xml.etree.ElementTree” is a built-in module that provides a parse() function that will accept an “XML” file and you can get the text from that file using the iter() function.

import xml.etree.ElementTree as ET

tree = ET.parse('books.xml')
root = tree.getroot()

for element in root.iter():
    print(element.text)

Output

The Great Gatsby
F. Scott Fitzgerald
1925
10.99


A Brief History of Time
Stephen Hawking
1988
14.99


Cien años de soledad
Gabriel García Márquez
1967
12.99

And as we expected, we got the exact output we wanted. Just DATA!

Time and Space Complexities

Time Complexity: O(n), where n is the number of elements in the XML tree.
Space Complexity: O(n), as it loads the entire tree into memory.

Pros

It is conducive for small-to-medium-sized XML files.
Built-in Python library with no additional dependency.
It provides a simple API to write code that anyone can understand.

Cons

It can become slow if your XML file is very large.

Method 2: Using “lxml”

If you are looking for a third-party solution then I would highly recommend using the “lxml” library. It provides etree.parse() method that accepts an XML file, use the .getroot() method to get the root of the file, and finally, use the .iter() and .text() methods to extract the content.

You can install the “lxml” library using the command below:

pip install lxml

Here is the code:

from lxml import etree

tree = etree.parse('books.xml')
root = tree.getroot()

for element in root.iter():
    print(element.text)

Output

The Great Gatsby
F. Scott Fitzgerald
1925
10.99


A Brief History of Time
Stephen Hawking
1988
14.99


Cien años de soledad
Gabriel García Márquez
1967
12.99

Time and Space Complexities

Time Complexity: O(n), but with better constant factors than xml.etree.ElementTree.
Space Complexity: O(n), as it loads the entire tree into memory.

Pros

Ideal for larger XML files and when advanced features like XPath are needed.
It performs better when you are working with a malformed XML file.

Cons

It can become slow if your XML file is very large.

Method 3: Using “xmltodict”

The “xmltodict” is a third-party library specifically used when you want your extracted data to be the Python dictionary. It provides a .parse() method that will read the XML file and return the dictionary.

import xmltodict

with open('books.xml', 'r') as file:
    data = xmltodict.parse(file.read())
    print(data)

Output

{'bookstore': {'book': [{'@category': 'fiction', 'title': {'@lang': 'en', '#text': 'The Great Gatsby'}, 'author': 'F. Scott Fitzgerald', 'year': '1925', 'price': '10.99'}, {'@category': 'non-fiction', 'title': {'@lang': 'en', '#text': 'A Brief History of Time'}, 'author': 'Stephen Hawking', 'year': '1988', 'price': '14.99'}, {'@category': 'fiction', 'title': {'@lang': 'es', '#text': 'Cien años de soledad'}, 'author': 'Gabriel García Márquez', 'year': '1967', 'price': '12.99'}]}}

Time and Space Complexities

Time Complexity: O(n), where n is the number of elements in the XML file.
Space Complexity: O(n)

Pros

It is helpful when you want to extract the data as a Python dictionary.
It performs better when you are working with a malformed XML file.

Cons

It loads the entire XML into memory, making it unsuitable for large files.
It will cause you to lose some XML structure information.

Method 4: Using the “re” module (regex)

Regular expressions are a de facto way when it comes to finding and extracting elements from a file, string, or any other object. You can use the re.findall() method to get the exact data you are looking for in an XML file.

If you are looking to extract specific pieces of data from a large XML file without parsing the entire structure, I would recommend you use the “regular expression” approach.

import re


def parse_xml_with_regex(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Updated regex pattern to capture entire book elements
    book_pattern = r'<book.*?>(.*?)</book>'
    books_raw = re.findall(book_pattern, content, re.DOTALL)

    # Process each book
    books = []
    for book_content in books_raw:
        book = {}
        # Parse individual fields within each book
        fields = re.findall(r'<(\w+).*?>(.*?)</\1>', book_content, re.DOTALL)
        for tag, value in fields:
            book[tag] = value.strip()
        books.append(book)

    return books


# Use the function
books = parse_xml_with_regex('books.xml')

# Print all books
print(f"Total number of books: {len(books)}")
for i, book in enumerate(books, 1):
    print(f"\nBook {i}:")
    for key, value in book.items():
        print(f"  {key}: {value}")

print("\nAll books have been printed.")

Output

Time and Space Complexities

Time Complexity: O(n), where n is the XML file size. However, the constant factors might be higher than specialized XML parsers for complex patterns.
Space Complexity: O(n) because it reads the entire file into memory.

Pros

It works blazingly fast for small XML files because you don’t need to create a full-fledged parser. It allows specific extraction as well!
Since it does not rely on third-party modules, you can use the “re” module on any environment that has Python installed.

Cons

It loads the entire XML into memory, making it unsuitable for large files.
You need to learn “regular expressions” which can be a steep learning curve.

That’s all!

Post Views: 32

Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.