How to Extract Email Address from Text in Python

Whether you are creating a contact list or compiling lists for newsletter distributions, you need to extract email addresses from either text or external data sources like a file.

Basic funda is simple: Try to search for an email-like pattern in a string or file, create a list of it, and print it or save it in another external file.

Here are three ways to extract email addresses from a text in Python:

Using simple regex
Using complex regex
Using email-validator library

Method 1: Using simple regex

The re(regular expression) module provides a “re.findall()” method that accepts a text and pattern and finds a string from the text that matches a pattern, creates a list of it, and returns it. Of course, we define a pattern for email addresses because that is what we need to extract.

import re


def simple_regex_email_extraction(text):
    # Creating a pattern that matches email addresses
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    # Finding a pattern that matches all the email addresses
    return re.findall(pattern, text)


text = "Contact us at support@appdividend.com or sales@appdividend.co.uk for more information."
print("List of Email Addresses:", simple_regex_email_extraction(text))

Output

List of Email Addresses: ['support@appdividend.com', 'sales@appdividend.co.uk']

The most notable strength of this method is that it is fast and easy to implement. However, it can miss complex email formats or incorrectly match invalid emails.

If you are looking for an approach that is quick and accuracy does not matter to you then I recommend you to use this method.

Extracting emails from an external file

If you are working with external data source file and you want to extract emails from that file and create a new file, you can also use this approach.

Assume that we have an external text file like this: data_source.txt

Company Contacts and Notes

Marketing Department

Sarah Johnson: sarah.johnson@example.com
Mike Brown: mike.brown@example.com
Contact for campaign ideas: ideas@marketing.example.com


IT Support

Help Desk: helpdesk@example.com
John Smith (Senior Technician): john.smith@it.example.com
For urgent matters: urgent_support@example.com


Human Resources

General Inquiries: hr@example.com
Jane Doe (HR Manager): jane.doe@hr.example.com


Customer Service

General Support: support@example.com
Returns Department: returns@example.com
Feedback: feedback@customer.example.com


Sales Team

New Business: newbusiness@sales.example.com
Account Management: accounts@sales.example.com


Invalid Email Examples (for testing)

No Domain: invalid@
No Username: @invalid.com
Missing Dot: invalid@examplecom
Double Dot: invalid@example..com
Special Characters: in^valid@example.com


Personal Emails (mixed format for testing)

alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk


Notes
Remember to update the quarterly newsletter subscription:
newsletter@example.com, subscribers@example.com
External Partners

Supplier A: contact@supplier-a.com
Consulting Firm: info@consultants.co
Freelance Designer: design@freelance.io

This file is filled with various email addresses. You can now use this file to test the email extraction function. Like this code:

import re


def simple_regex_email_extraction(text):
    # Creating a pattern that matches email addresses
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    # Finding a pattern that matches all the email addresses
    return re.findall(pattern, text)


def extract_emails_from_file(file_path):
    try:
        # Reading an external data_source.txt file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            emails = simple_regex_email_extraction(content)
        return emails
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except IOError:
        print(f"Error: Unable to read file '{file_path}'.")
        return []


# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
    print(email)

Output

Extracted 23 valid email(s):

sarah.johnson@example.com
mike.brown@example.com
ideas@marketing.example.com
helpdesk@example.com
john.smith@it.example.com
urgent_support@example.com
hr@example.com
jane.doe@hr.example.com
support@example.com
returns@example.com
feedback@customer.example.com
newbusiness@sales.example.com
accounts@sales.example.com
invalid@example..com
valid@example.com
alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk
newsletter@example.com
subscribers@example.com
contact@supplier-a.com
info@consultants.co
design@freelance.io

In this code, we are reading an external file and finding all the patterns that match “email addresses”, creating a list of them, then printing it on the console.

Method 2: Using complex regex

As the name suggests, it includes a comprehensive regex pattern that covers a broader range of email formats. This approach is extremely helpful when have a requirement to extract emails from text with various formats and higher accuracy.

import re


def complex_regex_extraction(text):
    pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e- 
              \x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]| 
              [01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
    return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)


text = "Contact us at krunal@appdividend.com or sales@appdividend.com.au for more information."
print("Complex Regex:", complex_regex_extraction(text))

Output

Complex Regex: ['krunal@appdividend.com', 'sales@appdividend.com.au']

If you are working with a simple usecase, this approach might be overkill and requires extensive knowledge of regular expression patterns. However, it catches most valid email formats.

Extracting emails from an external file

Let’s extract emails using “complex regex patterns” from a file using the code below:

import re


def complex_regex_email_extraction(text):
    pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
    return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)


def extract_emails_from_file(file_path):
    try:
        # Reading an external data_source.txt file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            emails = complex_regex_email_extraction(content)
        return emails
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except IOError:
        print(f"Error: Unable to read file '{file_path}'.")
        return []


# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
    print(email)

Output

Extracted 22 valid email(s):

sarah.johnson@example.com
mike.brown@example.com
ideas@marketing.example.com
helpdesk@example.com
john.smith@it.example.com
urgent_support@example.com
hr@example.com
jane.doe@hr.example.com
support@example.com
returns@example.com
feedback@customer.example.com
newbusiness@sales.example.com
accounts@sales.example.com
in^valid@example.com
alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk
newsletter@example.com
subscribers@example.com
contact@supplier-a.com
info@consultants.co
design@freelance.io

In this section, we used the external source file mentioned in the above section and changed the normal regex pattern to a complex regex pattern.

Method 3: Using the “email-validator” library

The “email-validator” library provides a “validate_email()” function that will validate the input emails extracted from a string or file. It combines regex extraction with a validation step to further strengthen our extraction process, leaving a 0% chance of inaccurate data.

First, you need to install the “email-validator” library using the below command:

pip install email-validator

Here is the complete code:

import re
from email_validator import validate_email, EmailNotValidError


def complex_regex_email_extraction(text):
    pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
    return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)


def validate_emails(emails):
    valid_emails = []
    for email in emails:
        try:
            # Validate each email extracted from a file
            validate_email(email)

            # Creating a list of each valid email
            valid_emails.append(email)
        except EmailNotValidError:
            pass
    return valid_emails


def extract_emails_from_file(file_path, validate=False):
    try:
        # Reading an external data_source.txt file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            emails = complex_regex_email_extraction(content)
        if validate:
            verified_emails = validate_emails(emails)
        return verified_emails
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except IOError:
        print(f"Error: Unable to read file '{file_path}'.")
        return []


# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path, validate=True)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
    print(email)

Output

Extracted 5 valid email(s):

alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk
info@consultants.co
design@freelance.io

Wow! Only five emails are valid, but why? In the above section, there were around 22 emails, but now there are just five. Can you guess what the reason could be? Well, the reason is simple.

The email-validator’s validate_email() function strictly validates each email based on RFC standards. It only considers legitimate email if the domain is International and known to the library. The @example.com is invalid domain because it does not recognize it. That’s why only “gmail”, “hotmail”, “yahoo”, “consutants”, and “freelance” are recognizable and it returns only those emails.

If you are looking for very strict validation then I highly recommend you to use this library approach. However, this approach is really slow because it adds validation step but it will get the job done.

Post Views: 12

Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.