How to extract all the emojis from text?

python python-3.x emoji

60,639

Solution 1

You can use the emoji library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI.

import emoji

def extract_emojis(s):
  return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])

Solution 2

I think it's important to point out that the previous answers won't work with emojis like 👨‍👩‍👦‍👦 , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJI will return 4 different emojis. Same for emojis with skin color like 🙅🏽.

My solution

Include the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like 👨‍👩‍👦‍👦

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
    
    return emoji_list

Testing

with more emojis with skin color:

line = ["🤔 🙈 me así, se 😌 ds 💕👭👙 hello 👩🏾‍🎓 emoji hello 👨‍👩‍👦‍👦 how are 😊 you today🙅🏽🙅🏽"]

counter = split_count(line[0])
print(' '.join(emoji for emoji in counter))

output:

🤔 🙈 😌 💕 👭 👙 👩🏾‍🎓 👨‍👩‍👦‍👦 😊 🙅🏽 🙅🏽

Include flags

If you want to include flags, like 🇵🇰 the Unicode range would be from 🇦 to 🇿, so add:

flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text)

to the function above, and return emoji_list + flags.

See this answer to "A python regex that matches the regional indicator character class" for more information about the flags.

For newer `emoji` versions

to work with emoji >= v1.2.0 you have to add a language specifier (e.g. en as in above code):

emoji.UNICODE_EMOJI['en']

Solution 3

If you don't want to use an external library, as a pythonic way you can simply use regular expressions and re.findall() with a proper regex to find the emojies:

In [74]: import re
In [75]: re.findall(r'[^\w\s,]', a_list[0])
Out[75]: ['🤔', '🙈', '😌', '💕', '👭', '👙']

The regular expression r'[^\w\s,]' is a negated character class that matches any character that is not a word character, whitespace or comma.

As I mentioned in comment, a text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.

Another solution is instead of a negated character class that excludes the non-emoji characters use a character class that accepts emojies ([] without ^). Since there are a lot of emojis with different unicode values, you just need to add the ranges to the character class. If you want to match more emojies here is a good reference contain all the standard emojies with the respective range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode:

Solution 4

import emojis
new_list = emojis.get('🤔 🙈 me así, bla es se 😌 ds 💕👭👙')
print(new_list)

output>>>{'😌', '🙈', '👭', '💕', '🤔', '👙'}

Solution 5

The top rated answer does not always work. For example flag emojis will not be found. Consider the string:

s = u'Hello \U0001f1f7\U0001f1fa hello'

What would work better is

import emoji
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
print(' '.join(r.findall(s)))

View more solutions

60,639

Author by

tumbleweed

Updated on July 05, 2022

Comments

tumbleweed almost 2 years
Consider the following list:
```
a_list = ['🤔 🙈 me así, bla es se 😌 ds 💕👭👙']
```
How can I extract in a new list all the emojis inside a_list?:
```
new_lis = ['🤔 🙈 😌 💕 👭 👙']
```
I tried to use regex, but I do not have all the possible emojis encodings.
user2357112 about 7 years

That's only one particular range of emoji. There are a lot more.
user2357112 about 7 years

That works for this particular input, but there are plenty of other non-emoji characters that don't fall under the categories of \w, \s, or comma.
Mazdak about 7 years

@user2357112 A text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually.. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.
user2357112 about 7 years

Your regex fails on all non-comma punctuation, among other things.
Mazdak about 7 years

@user2357112 Well that's what I said. You can add them to the character class if you want. You don't have to include all the cases always, its relative and based on the text that you're dealing with.
user2357112 about 7 years

Manually adding every non-emoji character from your text to your regex is a terrible, bloaty, error-prone solution.
Mazdak about 7 years

@user2357112 Maybe, just in case that your text contains all of those characters. Nevertheless, just for the sake of completeness I updated the answer with another way which is using the range of emojies and character class instead of excluding non-emojies.
shanraisshan about 7 years

You can download the list of emoji in string/int format present in #EmojiCodeSheet here, for custom comparator.
Nomiluks about 6 years

your code cannot detect flags in the text : extract_emojis("🇵🇰 👧 🏿")
Nomiluks about 6 years

Your code is working good, but how can we handle flags? "🇵🇰 "
Pedro Castilho about 6 years

@NomanDilawar that is because my code iterates over every character. Unicode flags are a combination of two "regional indicator" characters which are not, individually, emoji. If you want to detect Unicode flags you'll need to check pairs of characters.
sheldonzy about 6 years

@NomanDilawar Hi, sorry for the delay. I edited my answer. I ran some tests and it seems to work fine now.
kingmakerking almost 6 years

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if any(char in emoji.UNICODE_EMOJI for char in word): is what I am getting.
Paulo Malvar about 5 years

This is the only solution that I found to work comprehensively for all emojis I've encountered so far.
Amir Shabani over 4 years

You can replace print(' '.join(emoji for emoji in counter)) with print(' '.join(counter)). Does the same thing.
Amir Shabani over 4 years

Also, I think it's better to write for grapheme in data: instead of for word in data: as it reflects the purpose of \X better.
Alex about 3 years

As of emoji v.1.2.0, the check must include a language specifier, e.g. any(char in emoji.UNICODE_EMOJI["en"] for char in grapheme)
msarafzadeh about 3 years

@Nomiluks I had to filter it either per language or do a recursive dictionary search. '👶' in emoji.UNICODE_EMOJI['en']
Samuelf80 about 3 years

Thank you! Out of all the responses on the page, this worked the best
Jesse Aldridge about 3 years

Doesn't work in Python 3.6? I get an empty string.
Matteo almost 3 years

The answer has been updated to include ['en']. It should work again now.
Henry Munro over 2 years

I just needed to do a quick search of a code base and the following got what I needed: [^\w\s,;"{}='!*:\./[[]\-()\$#<>&@|\^`\+\?\\~‌%‘’£₵,€] Not scalable, but just in case anyone else finds it useful.