If you are categorizing websites by type (.edu, .tech, .ai) or region (.com, .in, .co.uk), you need to extract the Top-level Domain (TLD) from the URL. It will help you with web filtering systems and security tools.
The above figure shows what is TLD in a URL.
If you have used Google Analytics in the past, TLD extraction allows for grouping and analyzing traffic sources by domain type or country of origin. So, there are many use cases. But the question is, how to do it?
Here are three ways to extract top-level domain (TLD) from URL in Python:
- Using the tldextract library
- Using urllib.parse module
- Using regular expression
Method 1: Using the tldextract library
The tldextract module provides a method called “tldextract.extract()” to parse URL and TLD extraction.
To use this module, you need to install it first using the command below:
pip install tldextract
Now, you need to import the library and use its method like this:
import tldextract def extract_tld(url): extracted = tldextract.extract(url) return extracted.suffix # Usage url = "https://appdividend.com/category/python" tld = extract_tld(url) print(f"The TLD is: {tld}")
Output
The TLD is: com
I would highly recommend this approach because it provides accurate output and works well with multi-level domain-like (.co.uk, .co.in) and complex domains.
import tldextract def extract_tld(url): extracted = tldextract.extract(url) return extracted.suffix # Usage url = "https://sprintchase.co.uk/" tld = extract_tld(url) print(f"The TLD is: {tld}")
Output
The TLD is: co.uk
This library uses Mozilla’s “public suffix list”, which regularly updates with the latest TLD information and handles edge cases very well. However, you need to install the library first. So, external dependency is there for this operation.
Space Complexity: O(1) – The space used is constant regardless of input size.
Time Complexity: O(n) – where n is the length of the URL string.
Method 2: Using urllib.parse module
The urllib.parse module provides “urlparse()” method that you can use to parse the URL and extract the domain. In the next step, we can split the domain and take the last part as the TLD. It is a built-in module, so it does not require additional library installation.
from urllib.parse import urlparse def extract_tld(url): parsed_url = urlparse(url) domain = parsed_url.netloc tld = domain.split('.')[-1] return tld # Usage url = "https://sprintchase.com" tld = extract_tld(url) print(f"The TLD is: {tld}")
Output
The TLD is: com
The big disadvantage of this approach is that it does not work well with multi-level domains. For example, if you pass “https://sprintchase.co.uk”, it will return “.uk” and not “.co.uk”.
from urllib.parse import urlparse def extract_tld(url): parsed_url = urlparse(url) domain = parsed_url.netloc tld = domain.split('.')[-1] return tld # Usage url = "https://sprintchase.co.uk" tld = extract_tld(url) print(f"The TLD is: {tld}")
Output
The TLD is: uk
This method is simple and fast but not accurate.
Method 3: Using regular expression
When you want to find a specific substring from a string or URL, regular expressions are always at your disposal. This approach will always help you to get what you need from a text.
Python’s “re” module provides “re.search()” method that tries to match the pattern in the URL and if it finds then we will extract the last substring after the “.”(dot) from the match and return it to the user.
import re def extract_tld_re(url): pattern = r'(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)\.([^:\/\n]+)' match = re.search(pattern, url) if match: return match.group(2) return None # Usage url = "https://sprintchase.com" tld = extract_tld_re(url) print(f"The TLD is: {tld}")
Output
The TLD is: com
This approach is customizable and you can create any type of pattern you want. However, it requires a good understanding of how regular expressions work and this approach does not work with multi-level domains like .co.uk.
import re def extract_tld_re(url): pattern = r'(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)\.([^:\/\n]+)' match = re.search(pattern, url) if match: return match.group(2) return None # Usage url = "https://sprintchase.co.uk" tld = extract_tld_re(url) print(f"The TLD is: {tld}")
Output
The TLD is: uk
The output should be “.co.uk” but it returns “.uk” which is incorrect!
You can optimize the performance of Regular Expressions by using techniques like lazy quantifiers and atomic grouping.
Space Complexity: O(1) – The space used is constant regardless of input size.
Time Complexity: O(n) – Where n is the length of the input URL string.
That’s all!