Whether you are creating a contact list or compiling lists for newsletter distributions, you need to extract email addresses from either text or external data sources like a file.
Basic funda is simple: Try to search for an email-like pattern in a string or file, create a list of it, and print it or save it in another external file.
Here are three ways to extract email addresses from a text in Python:
- Using simple regex
- Using complex regex
- Using email-validator library
Method 1: Using simple regex
The re(regular expression) module provides a “re.findall()” method that accepts a text and pattern and finds a string from the text that matches a pattern, creates a list of it, and returns it. Of course, we define a pattern for email addresses because that is what we need to extract.
import re def simple_regex_email_extraction(text): # Creating a pattern that matches email addresses pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' # Finding a pattern that matches all the email addresses return re.findall(pattern, text) text = "Contact us at support@appdividend.com or sales@appdividend.co.uk for more information." print("List of Email Addresses:", simple_regex_email_extraction(text))
Output
List of Email Addresses: ['support@appdividend.com', 'sales@appdividend.co.uk']
The most notable strength of this method is that it is fast and easy to implement. However, it can miss complex email formats or incorrectly match invalid emails.
If you are looking for an approach that is quick and accuracy does not matter to you then I recommend you to use this method.
Extracting emails from an external file
If you are working with external data source file and you want to extract emails from that file and create a new file, you can also use this approach.
Assume that we have an external text file like this: data_source.txt
Company Contacts and Notes Marketing Department Sarah Johnson: sarah.johnson@example.com Mike Brown: mike.brown@example.com Contact for campaign ideas: ideas@marketing.example.com IT Support Help Desk: helpdesk@example.com John Smith (Senior Technician): john.smith@it.example.com For urgent matters: urgent_support@example.com Human Resources General Inquiries: hr@example.com Jane Doe (HR Manager): jane.doe@hr.example.com Customer Service General Support: support@example.com Returns Department: returns@example.com Feedback: feedback@customer.example.com Sales Team New Business: newbusiness@sales.example.com Account Management: accounts@sales.example.com Invalid Email Examples (for testing) No Domain: invalid@ No Username: @invalid.com Missing Dot: invalid@examplecom Double Dot: invalid@example..com Special Characters: in^valid@example.com Personal Emails (mixed format for testing) alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk Notes Remember to update the quarterly newsletter subscription: newsletter@example.com, subscribers@example.com External Partners Supplier A: contact@supplier-a.com Consulting Firm: info@consultants.co Freelance Designer: design@freelance.io
This file is filled with various email addresses. You can now use this file to test the email extraction function. Like this code:
import re def simple_regex_email_extraction(text): # Creating a pattern that matches email addresses pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' # Finding a pattern that matches all the email addresses return re.findall(pattern, text) def extract_emails_from_file(file_path): try: # Reading an external data_source.txt file with open(file_path, 'r', encoding='utf-8') as file: content = file.read() emails = simple_regex_email_extraction(content) return emails except FileNotFoundError: print(f"Error: File '{file_path}' not found.") return [] except IOError: print(f"Error: Unable to read file '{file_path}'.") return [] # Example usage file_path = 'data_source.txt' extracted_emails = extract_emails_from_file(file_path) print(f"Extracted {len(extracted_emails)} valid email(s):") for email in extracted_emails: print(email)
Output
Extracted 23 valid email(s): sarah.johnson@example.com mike.brown@example.com ideas@marketing.example.com helpdesk@example.com john.smith@it.example.com urgent_support@example.com hr@example.com jane.doe@hr.example.com support@example.com returns@example.com feedback@customer.example.com newbusiness@sales.example.com accounts@sales.example.com invalid@example..com valid@example.com alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk newsletter@example.com subscribers@example.com contact@supplier-a.com info@consultants.co design@freelance.io
In this code, we are reading an external file and finding all the patterns that match “email addresses”, creating a list of them, then printing it on the console.
Method 2: Using complex regex
As the name suggests, it includes a comprehensive regex pattern that covers a broader range of email formats. This approach is extremely helpful when have a requirement to extract emails from text with various formats and higher accuracy.
import re def complex_regex_extraction(text): pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e- \x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]| [01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])''' return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE) text = "Contact us at krunal@appdividend.com or sales@appdividend.com.au for more information." print("Complex Regex:", complex_regex_extraction(text))
Output
Complex Regex: ['krunal@appdividend.com', 'sales@appdividend.com.au']
If you are working with a simple usecase, this approach might be overkill and requires extensive knowledge of regular expression patterns. However, it catches most valid email formats.
Extracting emails from an external file
Let’s extract emails using “complex regex patterns” from a file using the code below:
import re def complex_regex_email_extraction(text): pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])''' return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE) def extract_emails_from_file(file_path): try: # Reading an external data_source.txt file with open(file_path, 'r', encoding='utf-8') as file: content = file.read() emails = complex_regex_email_extraction(content) return emails except FileNotFoundError: print(f"Error: File '{file_path}' not found.") return [] except IOError: print(f"Error: Unable to read file '{file_path}'.") return [] # Example usage file_path = 'data_source.txt' extracted_emails = extract_emails_from_file(file_path) print(f"Extracted {len(extracted_emails)} valid email(s):") for email in extracted_emails: print(email)
Output
Extracted 22 valid email(s): sarah.johnson@example.com mike.brown@example.com ideas@marketing.example.com helpdesk@example.com john.smith@it.example.com urgent_support@example.com hr@example.com jane.doe@hr.example.com support@example.com returns@example.com feedback@customer.example.com newbusiness@sales.example.com accounts@sales.example.com in^valid@example.com alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk newsletter@example.com subscribers@example.com contact@supplier-a.com info@consultants.co design@freelance.io
In this section, we used the external source file mentioned in the above section and changed the normal regex pattern to a complex regex pattern.
Method 3: Using the “email-validator” library
The “email-validator” library provides a “validate_email()” function that will validate the input emails extracted from a string or file. It combines regex extraction with a validation step to further strengthen our extraction process, leaving a 0% chance of inaccurate data.
First, you need to install the “email-validator” library using the below command:
pip install email-validator
Here is the complete code:
import re from email_validator import validate_email, EmailNotValidError def complex_regex_email_extraction(text): pattern = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])''' return re.findall(pattern, text, re.VERBOSE | re.IGNORECASE) def validate_emails(emails): valid_emails = [] for email in emails: try: # Validate each email extracted from a file validate_email(email) # Creating a list of each valid email valid_emails.append(email) except EmailNotValidError: pass return valid_emails def extract_emails_from_file(file_path, validate=False): try: # Reading an external data_source.txt file with open(file_path, 'r', encoding='utf-8') as file: content = file.read() emails = complex_regex_email_extraction(content) if validate: verified_emails = validate_emails(emails) return verified_emails except FileNotFoundError: print(f"Error: File '{file_path}' not found.") return [] except IOError: print(f"Error: Unable to read file '{file_path}'.") return [] # Example usage file_path = 'data_source.txt' extracted_emails = extract_emails_from_file(file_path, validate=True) print(f"Extracted {len(extracted_emails)} valid email(s):") for email in extracted_emails: print(email)
Output
Extracted 5 valid email(s): alice.wonder123@gmail.com bob_smith84@hotmail.com charlie.brown+work@yahoo.co.uk info@consultants.co design@freelance.io
Wow! Only five emails are valid, but why? In the above section, there were around 22 emails, but now there are just five. Can you guess what the reason could be? Well, the reason is simple.
The email-validator’s validate_email() function strictly validates each email based on RFC standards. It only considers legitimate email if the domain is International and known to the library. The @example.com is invalid domain because it does not recognize it. That’s why only “gmail”, “hotmail”, “yahoo”, “consutants”, and “freelance” are recognizable and it returns only those emails.
If you are looking for very strict validation then I highly recommend you to use this library approach. However, this approach is really slow because it adds validation step but it will get the job done.