Many traditional companies still use XML (Extended Markup Language) in web services. You can integrate the content of XML files into your application or create data points to analyze the data and that’s where data extraction from XML files comes into play.
Once you extract the text, you can apply various machine-learning algorithms to create LLMs (Large Language Models).
Here are four different ways to extract text from XML files in Python:
- Using “xml.etree.ElementTree” built-in module
- Using “lxml” (third-party library)
- Using “xmltodict” (for converting XML text into a dictionary)
- Using the “re” module (Regular Expressions)
To extract the data, we need a proper “XML” file. For this project, we will use the “books.xml” file that looks like the below image:
The above file has a valid XML structure.
Method 1: Using the “xml.etree.ElementTree” module
The “xml.etree.ElementTree” is a built-in module that provides a parse() function that will accept an “XML” file and you can get the text from that file using the iter() function.
import xml.etree.ElementTree as ET tree = ET.parse('books.xml') root = tree.getroot() for element in root.iter(): print(element.text)
Output
The Great Gatsby F. Scott Fitzgerald 1925 10.99 A Brief History of Time Stephen Hawking 1988 14.99 Cien años de soledad Gabriel García Márquez 1967 12.99
And as we expected, we got the exact output we wanted. Just DATA!
Time and Space Complexities
- Time Complexity: O(n), where n is the number of elements in the XML tree.
- Space Complexity: O(n), as it loads the entire tree into memory.
Pros
- It is conducive for small-to-medium-sized XML files.
- Built-in Python library with no additional dependency.
- It provides a simple API to write code that anyone can understand.
Cons
- It can become slow if your XML file is very large.
Method 2: Using “lxml”
If you are looking for a third-party solution then I would highly recommend using the “lxml” library. It provides etree.parse() method that accepts an XML file, use the .getroot() method to get the root of the file, and finally, use the .iter() and .text() methods to extract the content.
You can install the “lxml” library using the command below:
pip install lxml
Here is the code:
from lxml import etree tree = etree.parse('books.xml') root = tree.getroot() for element in root.iter(): print(element.text)
Output
The Great Gatsby F. Scott Fitzgerald 1925 10.99 A Brief History of Time Stephen Hawking 1988 14.99 Cien años de soledad Gabriel García Márquez 1967 12.99
Time and Space Complexities
- Time Complexity: O(n), but with better constant factors than xml.etree.ElementTree.
- Space Complexity: O(n), as it loads the entire tree into memory.
Pros
- Ideal for larger XML files and when advanced features like XPath are needed.
- It performs better when you are working with a malformed XML file.
Cons
- It can become slow if your XML file is very large.
Method 3: Using “xmltodict”
The “xmltodict” is a third-party library specifically used when you want your extracted data to be the Python dictionary. It provides a .parse() method that will read the XML file and return the dictionary.
import xmltodict with open('books.xml', 'r') as file: data = xmltodict.parse(file.read()) print(data)
Output
{'bookstore': {'book': [{'@category': 'fiction', 'title': {'@lang': 'en', '#text': 'The Great Gatsby'}, 'author': 'F. Scott Fitzgerald', 'year': '1925', 'price': '10.99'}, {'@category': 'non-fiction', 'title': {'@lang': 'en', '#text': 'A Brief History of Time'}, 'author': 'Stephen Hawking', 'year': '1988', 'price': '14.99'}, {'@category': 'fiction', 'title': {'@lang': 'es', '#text': 'Cien años de soledad'}, 'author': 'Gabriel García Márquez', 'year': '1967', 'price': '12.99'}]}}
Time and Space Complexities
- Time Complexity: O(n), where n is the number of elements in the XML file.
- Space Complexity: O(n)
Pros
- It is helpful when you want to extract the data as a Python dictionary.
- It performs better when you are working with a malformed XML file.
Cons
- It loads the entire XML into memory, making it unsuitable for large files.
- It will cause you to lose some XML structure information.
Method 4: Using the “re” module (regex)
Regular expressions are a de facto way when it comes to finding and extracting elements from a file, string, or any other object. You can use the re.findall() method to get the exact data you are looking for in an XML file.
If you are looking to extract specific pieces of data from a large XML file without parsing the entire structure, I would recommend you use the “regular expression” approach.
import re def parse_xml_with_regex(file_path): with open(file_path, 'r', encoding='utf-8') as file: content = file.read() # Updated regex pattern to capture entire book elements book_pattern = r'<book.*?>(.*?)</book>' books_raw = re.findall(book_pattern, content, re.DOTALL) # Process each book books = [] for book_content in books_raw: book = {} # Parse individual fields within each book fields = re.findall(r'<(\w+).*?>(.*?)</\1>', book_content, re.DOTALL) for tag, value in fields: book[tag] = value.strip() books.append(book) return books # Use the function books = parse_xml_with_regex('books.xml') # Print all books print(f"Total number of books: {len(books)}") for i, book in enumerate(books, 1): print(f"\nBook {i}:") for key, value in book.items(): print(f" {key}: {value}") print("\nAll books have been printed.")
Output
Time and Space Complexities
- Time Complexity: O(n), where n is the XML file size. However, the constant factors might be higher than specialized XML parsers for complex patterns.
- Space Complexity: O(n) because it reads the entire file into memory.
Pros
- It works blazingly fast for small XML files because you don’t need to create a full-fledged parser. It allows specific extraction as well!
- Since it does not rely on third-party modules, you can use the “re” module on any environment that has Python installed.
Cons
- It loads the entire XML into memory, making it unsuitable for large files.
- You need to learn “regular expressions” which can be a steep learning curve.
That’s all!