Skip to content
  • (+91) 9409548155
  • support@appdividend.com
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
Menu
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
Python

How to Extract Text from XML File in Python

  • 27 Sep, 2024
  • Com 0
How to Extract Text from XML File in Python

Many traditional companies still use XML (Extended Markup Language) in web services. You can integrate the content of XML files into your application or create data points to analyze the data and that’s where data extraction from XML files comes into play.

Once you extract the text, you can apply various machine-learning algorithms to create LLMs (Large Language Models).

Extracting Text from XML File in Python

Here are four different ways to extract text from XML files in Python:

  1. Using “xml.etree.ElementTree” built-in module
  2. Using “lxml” (third-party library)
  3. Using “xmltodict” (for converting XML text into a dictionary)
  4. Using the “re” module (Regular Expressions)

To extract the data, we need a proper “XML” file. For this project, we will use the “books.xml” file that looks like the below image:

Screenshot of books.xml file

The above file has a valid XML structure.

Method 1: Using the “xml.etree.ElementTree” module

The “xml.etree.ElementTree” is a built-in module that provides a parse() function that will accept an “XML” file and you can get the text from that file using the iter() function.

import xml.etree.ElementTree as ET

tree = ET.parse('books.xml')
root = tree.getroot()

for element in root.iter():
    print(element.text)

Output

The Great Gatsby
F. Scott Fitzgerald
1925
10.99


A Brief History of Time
Stephen Hawking
1988
14.99


Cien años de soledad
Gabriel García Márquez
1967
12.99

And as we expected, we got the exact output we wanted. Just DATA!

Time and Space Complexities

  1. Time Complexity: O(n), where n is the number of elements in the XML tree.
  2. Space Complexity: O(n), as it loads the entire tree into memory.

Pros

  1. It is conducive for small-to-medium-sized XML files.
  2. Built-in Python library with no additional dependency.
  3. It provides a simple API to write code that anyone can understand.

Cons

  1. It can become slow if your XML file is very large.

Method 2: Using “lxml”

If you are looking for a third-party solution then I would highly recommend using the “lxml” library. It provides etree.parse() method that accepts an XML file, use the .getroot() method to get the root of the file, and finally, use the .iter() and .text() methods to extract the content.

You can install the “lxml” library using the command below:

pip install lxml

Here is the code:

from lxml import etree

tree = etree.parse('books.xml')
root = tree.getroot()

for element in root.iter():
    print(element.text)

Output

The Great Gatsby
F. Scott Fitzgerald
1925
10.99


A Brief History of Time
Stephen Hawking
1988
14.99


Cien años de soledad
Gabriel García Márquez
1967
12.99

Time and Space Complexities

  1. Time Complexity: O(n), but with better constant factors than xml.etree.ElementTree.
  2. Space Complexity: O(n), as it loads the entire tree into memory.

Pros

  1. Ideal for larger XML files and when advanced features like XPath are needed.
  2. It performs better when you are working with a malformed XML file.

Cons

  1. It can become slow if your XML file is very large.

Method 3: Using “xmltodict”

The “xmltodict” is a third-party library specifically used when you want your extracted data to be the Python dictionary. It provides a .parse() method that will read the XML file and return the dictionary.

import xmltodict

with open('books.xml', 'r') as file:
    data = xmltodict.parse(file.read())
    print(data)

Output

{'bookstore': {'book': [{'@category': 'fiction', 'title': {'@lang': 'en', '#text': 'The Great Gatsby'}, 'author': 'F. Scott Fitzgerald', 'year': '1925', 'price': '10.99'}, {'@category': 'non-fiction', 'title': {'@lang': 'en', '#text': 'A Brief History of Time'}, 'author': 'Stephen Hawking', 'year': '1988', 'price': '14.99'}, {'@category': 'fiction', 'title': {'@lang': 'es', '#text': 'Cien años de soledad'}, 'author': 'Gabriel García Márquez', 'year': '1967', 'price': '12.99'}]}}

Time and Space Complexities

  1. Time Complexity: O(n), where n is the number of elements in the XML file.
  2. Space Complexity: O(n)

Pros

  1. It is helpful when you want to extract the data as a Python dictionary.
  2. It performs better when you are working with a malformed XML file.

Cons

  1. It loads the entire XML into memory, making it unsuitable for large files.
  2. It will cause you to lose some XML structure information.

Method 4: Using the “re” module (regex)

Regular expressions are a de facto way when it comes to finding and extracting elements from a file, string, or any other object. You can use the re.findall() method to get the exact data you are looking for in an XML file.

If you are looking to extract specific pieces of data from a large XML file without parsing the entire structure, I would recommend you use the “regular expression” approach.

import re


def parse_xml_with_regex(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Updated regex pattern to capture entire book elements
    book_pattern = r'<book.*?>(.*?)</book>'
    books_raw = re.findall(book_pattern, content, re.DOTALL)

    # Process each book
    books = []
    for book_content in books_raw:
        book = {}
        # Parse individual fields within each book
        fields = re.findall(r'<(\w+).*?>(.*?)</\1>', book_content, re.DOTALL)
        for tag, value in fields:
            book[tag] = value.strip()
        books.append(book)

    return books


# Use the function
books = parse_xml_with_regex('books.xml')

# Print all books
print(f"Total number of books: {len(books)}")
for i, book in enumerate(books, 1):
    print(f"\nBook {i}:")
    for key, value in book.items():
        print(f"  {key}: {value}")

print("\nAll books have been printed.")

Output

Output using re module to extract the text from XML file in Python

Time and Space Complexities

  1. Time Complexity: O(n), where n is the XML file size. However, the constant factors might be higher than specialized XML parsers for complex patterns.
  2. Space Complexity: O(n) because it reads the entire file into memory.

Pros

  1. It works blazingly fast for small XML files because you don’t need to create a full-fledged parser. It allows specific extraction as well!
  2. Since it does not rely on third-party modules, you can use the “re” module on any environment that has Python installed.

Cons

  1. It loads the entire XML into memory, making it unsuitable for large files.
  2. You need to learn “regular expressions” which can be a steep learning curve.

That’s all!

Post Views: 212
Share on:
Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.

Integrating React 18 in Laravel 11 [Step-by-step Guide]
How to Crop PDF Files with Python

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Address: TwinStar, South Block – 1202, 150 Ft Ring Road, Nr. Nana Mauva Circle, Rajkot(360005), Gujarat, India

Call: (+91) 9409548155

Email: support@appdividend.com

Online Platform

  • Pricing
  • Instructors
  • FAQ
  • Refund Policy
  • Support

Links

  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of services

Tutorials

  • Angular
  • React
  • Python
  • Laravel
  • Javascript
Copyright @2024 AppDividend. All Rights Reserved
Appdividend