Skip to content
  • (+91) 9409548155
  • support@appdividend.com
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
Menu
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
  • Home
  • Pricing
  • Instructor
  • Tutorials
    • Laravel
    • Python
    • React
    • Javascript
    • Angular
  • Become A Tutor
  • About Us
  • Contact Us
Python

Extracting Email Address from Text in Python

  • 18 Sep, 2024
  • Com 0
How to Extract Email Address from Text in Python

Here are three ways to extract email addresses from a text in Python:

  1. Using simple regex
  2. Using complex regex
  3. Using the email-validator library

The basic funda is simple: try to search for an email-like pattern in a string or file, create a list of it, and print or save it in another external file.

Method 1: Using simple regex

The re(regular expression) module provides a “re.findall()” method that accepts a text and a pattern and finds a string from the text that matches a pattern, creates a list of it, and returns it.

Of course, we define a pattern for email addresses because that is what we need to extract.

Method 1 - Using simple regex method to extract emails from text

import re


def simple_regex_email_extraction(text):
    # Creating a pattern that matches email addresses
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    # Finding a pattern that matches all the email addresses
    return re.findall(pattern, text)


text = "Contact us at support@appdividend.com or sales@appdividend.co.uk for more information."
print("List of Email Addresses:", simple_regex_email_extraction(text))

Output

List of Email Addresses: ['support@appdividend.com', 'sales@appdividend.co.uk']

The most notable strength of this method is that it is fast and easy to implement. However, it can miss complex email formats or incorrectly match invalid emails.

If you are looking for a quick approach where accuracy isn’t a concern, I recommend using this method.

Extracting emails from an external file

If you are working with an external data source file and you want to extract emails from that file and create a new file, you can also use this approach.

Assume that we have an external text file like this: data_source.txt

Company Contacts and Notes

Marketing Department

Sarah Johnson: sarah.johnson@example.com
Mike Brown: mike.brown@example.com
Contact for campaign ideas: ideas@marketing.example.com


IT Support

Help Desk: helpdesk@example.com
John Smith (Senior Technician): john.smith@it.example.com
For urgent matters: urgent_support@example.com


Human Resources

General Inquiries: hr@example.com
Jane Doe (HR Manager): jane.doe@hr.example.com


Customer Service

General Support: support@example.com
Returns Department: returns@example.com
Feedback: feedback@customer.example.com


Sales Team

New Business: newbusiness@sales.example.com
Account Management: accounts@sales.example.com


Invalid Email Examples (for testing)

No Domain: invalid@
No Username: @invalid.com
Missing Dot: invalid@examplecom
Double Dot: invalid@example..com
Special Characters: in^valid@example.com


Personal Emails (mixed format for testing)

alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk


Notes
Remember to update the quarterly newsletter subscription:
newsletter@example.com, subscribers@example.com
External Partners

Supplier A: contact@supplier-a.com
Consulting Firm: info@consultants.co
Freelance Designer: design@freelance.io

This file contains various email addresses.

You can now use this file to test the email extraction function. Like this code:

import re


def simple_regex_email_extraction(text):
    # Creating a pattern that matches email addresses
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    # Finding a pattern that matches all the email addresses
    return re.findall(pattern, text)


def extract_emails_from_file(file_path):
    try:
        # Reading an external data_source.txt file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            emails = simple_regex_email_extraction(content)
        return emails
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except IOError:
        print(f"Error: Unable to read file '{file_path}'.")
        return []


# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
    print(email)

Output

Extracted 23 valid email(s):

sarah.johnson@example.com
mike.brown@example.com
ideas@marketing.example.com
helpdesk@example.com
john.smith@it.example.com
urgent_support@example.com
hr@example.com
jane.doe@hr.example.com
support@example.com
returns@example.com
feedback@customer.example.com
newbusiness@sales.example.com
accounts@sales.example.com
invalid@example..com
valid@example.com
alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk
newsletter@example.com
subscribers@example.com
contact@supplier-a.com
info@consultants.co
design@freelance.io

In this code, we read an external file, find all patterns that match “email addresses”, create a list of them, and then print it to the console.

Method 2: Using complex regex

As the name suggests, it includes a comprehensive regular expression (regex) pattern that covers a broader range of email formats.

This approach is extremely helpful when you have a requirement to extract emails from text with various formats and higher accuracy.

Method 2 - Using complex regex method to extract emails from text

import re


def complex_regex_extraction(text):
    pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e- 
              \x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]| 
              [01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
    return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)


text = "Contact us at krunal@appdividend.com or sales@appdividend.com.au for more information."
print("Complex Regex:", complex_regex_extraction(text))

Output

Complex Regex: ['krunal@appdividend.com', 'sales@appdividend.com.au']

If you are working with a simple use case, this approach might be overkill and requires extensive knowledge of regular expression patterns. However, it catches most valid email formats.

Extracting emails from an external file

Let’s extract emails using “complex regex patterns” from a file using the code below:

import re


def complex_regex_email_extraction(text):
    pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
    return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)


def extract_emails_from_file(file_path):
    try:
        # Reading an external data_source.txt file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            emails = complex_regex_email_extraction(content)
        return emails
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except IOError:
        print(f"Error: Unable to read file '{file_path}'.")
        return []


# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
    print(email)

Output

Extracted 22 valid email(s):

sarah.johnson@example.com
mike.brown@example.com
ideas@marketing.example.com
helpdesk@example.com
john.smith@it.example.com
urgent_support@example.com
hr@example.com
jane.doe@hr.example.com
support@example.com
returns@example.com
feedback@customer.example.com
newbusiness@sales.example.com
accounts@sales.example.com
in^valid@example.com
alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk
newsletter@example.com
subscribers@example.com
contact@supplier-a.com
info@consultants.co
design@freelance.io

In this section, we used the external source file mentioned in the previous section and modified the standard regular expression pattern to a more complex one.

Method 3: Using the “email-validator” library

The “email-validator” library provides a “validate_email()” function that will validate the input emails extracted from a string or file.

It combines regular expression extraction with a validation step to further strengthen our extraction process, ensuring a 0% chance of inaccurate data.

First, you need to install the “email-validator” library using the command below:

pip install email-validator

Here is the complete code:

import re
from email_validator import validate_email, EmailNotValidError


def complex_regex_email_extraction(text):
    pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
    return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE)


def validate_emails(emails):
    valid_emails = []
    for email in emails:
        try:
            # Validate each email extracted from a file
            validate_email(email)

            # Creating a list of each valid email
            valid_emails.append(email)
        except EmailNotValidError:
            pass
    return valid_emails


def extract_emails_from_file(file_path, validate=False):
    try:
        # Reading an external data_source.txt file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
            emails = complex_regex_email_extraction(content)
        if validate:
            verified_emails = validate_emails(emails)
        return verified_emails
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except IOError:
        print(f"Error: Unable to read file '{file_path}'.")
        return []


# Example usage
file_path = 'data_source.txt'
extracted_emails = extract_emails_from_file(file_path, validate=True)
print(f"Extracted {len(extracted_emails)} valid email(s):")
for email in extracted_emails:
    print(email)

Output

Extracted 5 valid email(s):

alice.wonder123@gmail.com
bob_smith84@hotmail.com
charlie.brown+work@yahoo.co.uk
info@consultants.co
design@freelance.io

Wow! Only five emails are valid, but why is that? In the above section, there were around 22 emails, but now there are just five. Can you guess what the reason could be? Well, the reason is simple. 

The email validator’s validate_email() function strictly validates each email address based on RFC standards. It only considers legitimate email if the domain is International and known to the library.

The @example.com is an invalid domain because it is not recognized. 

That’s why only “gmail”, “hotmail”, “yahoo”, “consultants”, and “freelance” are recognizable, and it returns only emails from those domains.

If you are looking for rigorous validation, then I highly recommend using this library approach. However, this approach is relatively slow because it adds a validation step; nevertheless, it will accomplish the task.

Post Views: 206
Share on:
Krunal Lathiya

With a career spanning over eight years in the field of Computer Science, Krunal’s expertise is rooted in a solid foundation of hands-on experience, complemented by a continuous pursuit of knowledge.

How to Remove HTML Tags from a String in JavaScript
How to Extract String from Between Quotations in Python

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Address: TwinStar, South Block – 1202, 150 Ft Ring Road, Nr. Nana Mauva Circle, Rajkot(360005), Gujarat, India

Call: (+91) 9409548155

Email: support@appdividend.com

Online Platform

  • Pricing
  • Instructors
  • FAQ
  • Refund Policy
  • Support

Links

  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of services

Tutorials

  • Angular
  • React
  • Python
  • Laravel
  • Javascript
Copyright @2024 AppDividend. All Rights Reserved
Appdividend