In the modern world, if you do not use emojis in your text, you will be considered “grandpa!” But this can be annoying when it comes to analyzing the data.
If you are working with traditional NLP models, they are not trained on or optimized for emoji usage. To train them and feed them data accurately, we need to remove emojis from the text or string.
The above image representation has a text that contains two types of unicode characters:
- Normal string (Data)
- Emojis (✍ 🌷 📌 👈🏻 🖥 😀 😃 😄 😁 😆 😅 😂)
After removing the emojis from the text, it looks like it only has a regular string.
The main goal of this tutorial is to clean the text that contains emojis programmatically.
Here are three ways to remove emojis from the text in Python:
- Using the “re” module
- Using an “emoji” package
- Using “cleantext” module
Method 1: Using “re” module
When it comes to finding and replacing specific patterns, there is no better module than “re”. Our first step would be to create emoji_pattern object using “re.compile()” method and use the emoji_pattern.sub(r”, text) method to replace any matches of the emoji pattern within the string with an empty string (”), effectively removing them.
If you are looking for customization and performance, go for this approach because it covers a wide range of emoji unicodes and is relatively fast for most use cases.
import re # Custom function that uses .compile() and .sub() to replace # emoji type patterns with empty string def remove_emoji(txt): emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F1E0-\U0001F1FF" # flags (iOS) u"\U00002702-\U000027B0" u"\U000024C2-\U0001F251" "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', txt) txt = "Data ✍ 🌷 📌 👈🏻 🖥 😀 😃 😄 😁 😆 😅 😂" print("Before removing emojis") print(txt) print("After removing emojis") print(remove_emoji(txt))
Output
Before removing emojis Data ✍ 🌷 📌 👈🏻 🖥 😀 😃 😄 😁 😆 😅 😂 After removing emojis Data
Method 2: Using “emoji” module
The “emoji” library provides emoji.replace_emoji() function that finds and replaces “emojis” with an empty string.
First, you need to install the “emoji” library using the command below:
pip install emoji
Now, you can import it and use its replace_emoji() function.
import emoji # Creating a custom function that accepts "txt" # and passing that "txt" to the "emoji.replace_emoji()" function # to remove emojis from the string. def remove_using_emoji(txt): return emoji.replace_emoji(txt, '') txt = "Data ✍ 🌷 📌 👈🏻 🖥 😀 😃 😄 😁 😆 😅 😂" print("Before removing emojis") print(txt) print("After removing emojis") print(remove_using_emoji(txt))
Output
Before removing emojis Data ✍ 🌷 📌 👈🏻 🖥 😀 😃 😄 😁 😆 😅 😂 After removing emojis Data
This approach requires you to install the external library. However, if your aim is to create a simple solution, I highly recommend using this approach. Furthermore, developers regularly update the library to include new types of emojis so it also covers all the use cases.
Method 3: Using “clean-text” module
Ahh, another third-party module called “clean-text” handles emojis very well! If you are looking for not only emoji removal but also a comprehensive text-cleaning solution, then I would advise you to use this approach.
from cleantext import clean # Creating a custom function that accepts "txt" # and passing that "txt" to the "clean()" function # to remove email addresses, digits, remove emojis def remove_using_cleantext(txt): cleaned_data = clean(txt, no_emails=True, # Remove email addresses no_digits=True, # Remove digits no_emoji=True, # Remove emojis replace_with_email="", # Replace emails with empty string replace_with_digit="") # Replace digits with empty string return cleaned_data txt = "Contact me at krunal@appdividend.com or call 444 555 9999 📞. Have a great day! 😊" print("Before removing email address, emojis, and digits") print(txt) print("After removing email address, emojis, and digits") print(remove_using_cleantext(txt))
Output
Before removing email address, emojis, and digits Contact me at krunal@appdividend.com or call 444 555 9999 📞. Have a great day! 😊 After removing email address, emojis, and digits contact me at or call . have a great day!
As you can see from the code, we used cleantext.clean() function and pass the “txt” and “no_emoji = True”, no_emails=True, no_digits=True arguments to remove emojis, emails, and digits from an input text and replace default “email” and “digit” with an “empty string”. The final output won’t include any of these things.
This solution is simple. However, if you are just removing specific things, then I would not endorse you to use this approach because there are targeted packages available that you can use.
You should only use this method when you want to remove a combination of data, such as “emojis, emails, and digits,” “emojis and emails,” “emojis and digits,” or any other combination you find plausible.
That’s all I needed to address for this tutorial. I hope all of you programmers have a nice day.