How to extract all the emojis from text?
Solution 1
You can use the emoji
library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI
.
import emoji
def extract_emojis(s):
return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])
Solution 2
I think it's important to point out that the previous answers won't work with emojis like π¨βπ©βπ¦βπ¦ , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJI
will return 4 different emojis. Same for emojis with skin color like π
π½.
My solution
Include the emoji
and regex
modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like π¨βπ©βπ¦βπ¦
import emoji
import regex
def split_count(text):
emoji_list = []
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
emoji_list.append(word)
return emoji_list
Testing
with more emojis with skin color:
line = ["π€ π me asΓ, se π ds πππ hello π©πΎβπ emoji hello π¨βπ©βπ¦βπ¦ how are π you todayπ
π½π
π½"]
counter = split_count(line[0])
print(' '.join(emoji for emoji in counter))
output:
π€ π π π π π π©πΎβπ π¨βπ©βπ¦βπ¦ π π
π½ π
π½
Include flags
If you want to include flags, like π΅π° the Unicode range would be from π¦ to πΏ, so add:
flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text)
to the function above, and return emoji_list + flags
.
See this answer to "A python regex that matches the regional indicator character class" for more information about the flags.
For newer emoji
versions
to work with emoji >= v1.2.0 you have to add a language specifier (e.g. en
as in above code):
emoji.UNICODE_EMOJI['en']
Solution 3
If you don't want to use an external library, as a pythonic way you can simply use regular expressions and re.findall()
with a proper regex to find the emojies:
In [74]: import re
In [75]: re.findall(r'[^\w\s,]', a_list[0])
Out[75]: ['π€', 'π', 'π', 'π', 'π', 'π']
The regular expression r'[^\w\s,]'
is a negated character class that matches any character that is not a word character, whitespace or comma.
As I mentioned in comment, a text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.
Another solution is instead of a negated character class that excludes the non-emoji characters use a character class that accepts emojies ([]
without ^
). Since there are a lot of emojis with different unicode values, you just need to add the ranges to the character class. If you want to match more emojies here is a good reference contain all the standard emojies with the respective range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode:
Solution 4
import emojis
new_list = emojis.get('π€ π me asΓ, bla es se π ds πππ')
print(new_list)
output>>>{'π', 'π', 'π', 'π', 'π€', 'π'}
Solution 5
The top rated answer does not always work. For example flag emojis will not be found. Consider the string:
s = u'Hello \U0001f1f7\U0001f1fa hello'
What would work better is
import emoji
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
print(' '.join(r.findall(s)))
tumbleweed
Updated on July 05, 2022Comments
-
tumbleweed almost 2 years
Consider the following list:
a_list = ['π€ π me asΓ, bla es se π ds πππ']
How can I extract in a new list all the emojis inside
a_list
?:new_lis = ['π€ π π π π π']
I tried to use regex, but I do not have all the possible emojis encodings.
-
user2357112 about 7 yearsThat's only one particular range of emoji. There are a lot more.
-
user2357112 about 7 yearsThat works for this particular input, but there are plenty of other non-emoji characters that don't fall under the categories of
\w
,\s
, or comma. -
Mazdak about 7 years@user2357112 A text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually.. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.
-
user2357112 about 7 yearsYour regex fails on all non-comma punctuation, among other things.
-
Mazdak about 7 years@user2357112 Well that's what I said. You can add them to the character class if you want. You don't have to include all the cases always, its relative and based on the text that you're dealing with.
-
user2357112 about 7 yearsManually adding every non-emoji character from your text to your regex is a terrible, bloaty, error-prone solution.
-
Mazdak about 7 years@user2357112 Maybe, just in case that your text contains all of those characters. Nevertheless, just for the sake of completeness I updated the answer with another way which is using the range of emojies and character class instead of excluding non-emojies.
-
shanraisshan about 7 yearsYou can download the list of emoji in string/int format present in #EmojiCodeSheet here, for custom comparator.
-
Nomiluks about 6 yearsyour code cannot detect flags in the text : extract_emojis("π΅π° π§ πΏ")
-
Nomiluks about 6 yearsYour code is working good, but how can we handle flags? "π΅π° "
-
Pedro Castilho about 6 years@NomanDilawar that is because my code iterates over every character. Unicode flags are a combination of two "regional indicator" characters which are not, individually, emoji. If you want to detect Unicode flags you'll need to check pairs of characters.
-
sheldonzy about 6 years@NomanDilawar Hi, sorry for the delay. I edited my answer. I ran some tests and it seems to work fine now.
-
kingmakerking almost 6 years
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if any(char in emoji.UNICODE_EMOJI for char in word):
is what I am getting. -
Paulo Malvar about 5 yearsThis is the only solution that I found to work comprehensively for all emojis I've encountered so far.
-
Amir Shabani over 4 yearsYou can replace
print(' '.join(emoji for emoji in counter))
withprint(' '.join(counter))
. Does the same thing. -
Amir Shabani over 4 yearsAlso, I think it's better to write
for grapheme in data:
instead offor word in data:
as it reflects the purpose of\X
better. -
Alex about 3 yearsAs of emoji v.1.2.0, the check must include a language specifier, e.g.
any(char in emoji.UNICODE_EMOJI["en"] for char in grapheme)
-
msarafzadeh about 3 years@Nomiluks I had to filter it either per language or do a recursive dictionary search. 'πΆ' in emoji.UNICODE_EMOJI['en']
-
Samuelf80 about 3 yearsThank you! Out of all the responses on the page, this worked the best
-
Jesse Aldridge about 3 yearsDoesn't work in Python 3.6? I get an empty string.
-
Matteo almost 3 yearsThe answer has been updated to include ['en']. It should work again now.
-
Henry Munro over 2 yearsI just needed to do a quick search of a code base and the following got what I needed: [^\w\s,;"{}='!*:\./[[]\-()\$#<>&@|\^`\+\?\\~%ββ£β΅,β¬] Not scalable, but just in case anyone else finds it useful.