Here are three ways to extract email addresses from a text in Python:
- Using simple regex
- Using complex regex
- Using the email-validator library
The basic funda is simple: try to search for an email-like pattern in a string or file, create a list of it, and print or save it in another external file.
Method 1: Using simple regex
The re(regular expression) module provides a “re.findall()” method that accepts a text and a pattern and finds a string from the text that matches a pattern, creates a list of it, and returns it.
Of course, we define a pattern for email addresses because that is what we need to extract.
import re
def simple_regex_email_extraction(text):
# Creating a pattern that matches email addresses
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
# Finding a pattern that matches all the email addresses
return re.findall(pattern, text)
text = "Contact us at support@appdividend.com or sales@appdividend.co.uk for more information."
print("List of Email Addresses:", simple_regex_email_extraction(text))
Output
List of Email Addresses: ['support@appdividend.com', 'sales@appdividend.co.uk']
The most notable strength of this method is that it is fast and easy to implement. However, it can miss complex email formats or incorrectly match invalid emails.
If you are looking for a quick approach where accuracy isn’t a concern, I recommend using this method.
Extracting emails from an external file
If you are working with an external data source file and you want to extract emails from that file and create a new file, you can also use this approach.
Assume that we have an external text file like this: data_source.txt
Company Contacts and Notes Marketing Department Sarah Johnson: sarah.johnson@example.com Mike Brown: mike.brown@example.com Contact for campaign ideas: ideas@marketing.example.com IT Support Help Desk: helpdesk@example.com John Smith (Senior Technician): john.smith@it.example.com For urgent matters: urgent_support@example.com Human Resources General Inquiries: hr@example.com Jane Doe (HR Manager): jane.doe@hr.example.com Customer Service General Support: support@example.com Returns Department: returns@example.com Feedback: feedback@customer.example.com Sales Team New Business: newbusiness@sales.example.com Account Management: accounts@sales.example.com Invalid Email Examples (for testing) No Domain: invalid@ No Username: @invalid.com Missing Dot: invalid@examplecom Double Dot: invalid@example..com Special Characters: in^valid@example.com Personal Emails (mixed format for testing) alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk Notes Remember to update the quarterly newsletter subscription: newsletter@example.com, subscribers@example.com External Partners Supplier A: contact@supplier-a.com Consulting Firm: info@consultants.co Freelance Designer: design@freelance.io
This file contains various email addresses.
You can now use this file to test the email extraction function. Like this code:
import re
def simple_regex_email_extraction(text):
# Creating a pattern that matches email addresses
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
# Finding a pattern that matches all the email addresses
return re.findall(pattern, text)
def extract_emails_from_file(file_path):
try:
# Reading an external data_source.txt file
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
emails = simple_regex_email_extraction(content)
return emails
except FileNotFoundError:
print(f"Error: File '{file_path}' not found.")
return []
except IOError:
print(f"Error: Unable to read file '{file_path}'.")
return []
# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
print(email)
Output
Extracted 23 valid email(s): sarah.johnson@example.com mike.brown@example.com ideas@marketing.example.com helpdesk@example.com john.smith@it.example.com urgent_support@example.com hr@example.com jane.doe@hr.example.com support@example.com returns@example.com feedback@customer.example.com newbusiness@sales.example.com accounts@sales.example.com invalid@example..com valid@example.com alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk newsletter@example.com subscribers@example.com contact@supplier-a.com info@consultants.co design@freelance.io
In this code, we read an external file, find all patterns that match “email addresses”, create a list of them, and then print it to the console.
Method 2: Using complex regex
As the name suggests, it includes a comprehensive regular expression (regex) pattern that covers a broader range of email formats.
This approach is extremely helpful when you have a requirement to extract emails from text with various formats and higher accuracy.
import re
def complex_regex_extraction(text):
pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|
[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)
text = "Contact us at krunal@appdividend.com or sales@appdividend.com.au for more information."
print("Complex Regex:", complex_regex_extraction(text))
Output
Complex Regex: ['krunal@appdividend.com', 'sales@appdividend.com.au']
If you are working with a simple use case, this approach might be overkill and requires extensive knowledge of regular expression patterns. However, it catches most valid email formats.
Extracting emails from an external file
Let’s extract emails using “complex regex patterns” from a file using the code below:
import re
def complex_regex_email_extraction(text):
pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)
def extract_emails_from_file(file_path):
try:
# Reading an external data_source.txt file
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
emails = complex_regex_email_extraction(content)
return emails
except FileNotFoundError:
print(f"Error: File '{file_path}' not found.")
return []
except IOError:
print(f"Error: Unable to read file '{file_path}'.")
return []
# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
print(email)
Output
Extracted 22 valid email(s): sarah.johnson@example.com mike.brown@example.com ideas@marketing.example.com helpdesk@example.com john.smith@it.example.com urgent_support@example.com hr@example.com jane.doe@hr.example.com support@example.com returns@example.com feedback@customer.example.com newbusiness@sales.example.com accounts@sales.example.com in^valid@example.com alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk newsletter@example.com subscribers@example.com contact@supplier-a.com info@consultants.co design@freelance.io
In this section, we used the external source file mentioned in the previous section and modified the standard regular expression pattern to a more complex one.
Method 3: Using the “email-validator” library
The “email-validator” library provides a “validate_email()” function that will validate the input emails extracted from a string or file.
It combines regular expression extraction with a validation step to further strengthen our extraction process, ensuring a 0% chance of inaccurate data.
First, you need to install the “email-validator” library using the command below:
pip install email-validator
Here is the complete code:
import re
from email_validator import validate_email, EmailNotValidError
def complex_regex_email_extraction(text):
pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)
def validate_emails(emails):
valid_emails = []
for email in emails:
try:
# Validate each email extracted from a file
validate_email(email)
# Creating a list of each valid email
valid_emails.append(email)
except EmailNotValidError:
pass
return valid_emails
def extract_emails_from_file(file_path, validate=False):
try:
# Reading an external data_source.txt file
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
emails = complex_regex_email_extraction(content)
if validate:
verified_emails = validate_emails(emails)
return verified_emails
except FileNotFoundError:
print(f"Error: File '{file_path}' not found.")
return []
except IOError:
print(f"Error: Unable to read file '{file_path}'.")
return []
# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path, validate=True)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
print(email)
Output
Extracted 5 valid email(s): alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk info@consultants.co design@freelance.io
Wow! Only five emails are valid, but why is that? In the above section, there were around 22 emails, but now there are just five. Can you guess what the reason could be? Well, the reason is simple.
The email validator’s validate_email() function strictly validates each email address based on RFC standards. It only considers legitimate email if the domain is International and known to the library.
The @example.com is an invalid domain because it is not recognized.
That’s why only “gmail”, “hotmail”, “yahoo”, “consultants”, and “freelance” are recognizable, and it returns only emails from those domains.
If you are looking for rigorous validation, then I highly recommend using this library approach. However, this approach is relatively slow because it adds a validation step; nevertheless, it will accomplish the task.


