How to extract all the emojis from text?

60,639

Solution 1

You can use the emoji library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI.

import emoji

def extract_emojis(s):
  return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])

Solution 2

I think it's important to point out that the previous answers won't work with emojis like πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJI will return 4 different emojis. Same for emojis with skin color like πŸ™…πŸ½.

My solution

Include the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
    
    return emoji_list

Testing

with more emojis with skin color:

line = ["πŸ€” πŸ™ˆ me asΓ­, se 😌 ds πŸ’•πŸ‘­πŸ‘™ hello πŸ‘©πŸΎβ€πŸŽ“ emoji hello πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ how are 😊 you todayπŸ™…πŸ½πŸ™…πŸ½"]

counter = split_count(line[0])
print(' '.join(emoji for emoji in counter))

output:

πŸ€” πŸ™ˆ 😌 πŸ’• πŸ‘­ πŸ‘™ πŸ‘©πŸΎβ€πŸŽ“ πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ 😊 πŸ™…πŸ½ πŸ™…πŸ½

Include flags

If you want to include flags, like πŸ‡΅πŸ‡° the Unicode range would be from πŸ‡¦ to πŸ‡Ώ, so add:

flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text) 

to the function above, and return emoji_list + flags.

See this answer to "A python regex that matches the regional indicator character class" for more information about the flags.

For newer emoji versions

to work with emoji >= v1.2.0 you have to add a language specifier (e.g. en as in above code):

emoji.UNICODE_EMOJI['en']

Solution 3

If you don't want to use an external library, as a pythonic way you can simply use regular expressions and re.findall() with a proper regex to find the emojies:

In [74]: import re
In [75]: re.findall(r'[^\w\s,]', a_list[0])
Out[75]: ['πŸ€”', 'πŸ™ˆ', '😌', 'πŸ’•', 'πŸ‘­', 'πŸ‘™']

The regular expression r'[^\w\s,]' is a negated character class that matches any character that is not a word character, whitespace or comma.

As I mentioned in comment, a text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.

Another solution is instead of a negated character class that excludes the non-emoji characters use a character class that accepts emojies ([] without ^). Since there are a lot of emojis with different unicode values, you just need to add the ranges to the character class. If you want to match more emojies here is a good reference contain all the standard emojies with the respective range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode:

Solution 4

import emojis
new_list = emojis.get('πŸ€” πŸ™ˆ me asΓ­, bla es se 😌 ds πŸ’•πŸ‘­πŸ‘™')
print(new_list)

output>>>{'😌', 'πŸ™ˆ', 'πŸ‘­', 'πŸ’•', 'πŸ€”', 'πŸ‘™'}

Solution 5

The top rated answer does not always work. For example flag emojis will not be found. Consider the string:

s = u'Hello \U0001f1f7\U0001f1fa hello'

What would work better is

import emoji
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
print(' '.join(r.findall(s)))
Share:
60,639
tumbleweed
Author by

tumbleweed

Updated on July 05, 2022

Comments

  • tumbleweed
    tumbleweed almost 2 years

    Consider the following list:

    a_list = ['πŸ€” πŸ™ˆ me asΓ­, bla es se 😌 ds πŸ’•πŸ‘­πŸ‘™']
    

    How can I extract in a new list all the emojis inside a_list?:

    new_lis = ['πŸ€” πŸ™ˆ 😌 πŸ’• πŸ‘­ πŸ‘™']
    

    I tried to use regex, but I do not have all the possible emojis encodings.

  • user2357112
    user2357112 about 7 years
    That's only one particular range of emoji. There are a lot more.
  • user2357112
    user2357112 about 7 years
    That works for this particular input, but there are plenty of other non-emoji characters that don't fall under the categories of \w, \s, or comma.
  • Mazdak
    Mazdak about 7 years
    @user2357112 A text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually.. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.
  • user2357112
    user2357112 about 7 years
    Your regex fails on all non-comma punctuation, among other things.
  • Mazdak
    Mazdak about 7 years
    @user2357112 Well that's what I said. You can add them to the character class if you want. You don't have to include all the cases always, its relative and based on the text that you're dealing with.
  • user2357112
    user2357112 about 7 years
    Manually adding every non-emoji character from your text to your regex is a terrible, bloaty, error-prone solution.
  • Mazdak
    Mazdak about 7 years
    @user2357112 Maybe, just in case that your text contains all of those characters. Nevertheless, just for the sake of completeness I updated the answer with another way which is using the range of emojies and character class instead of excluding non-emojies.
  • shanraisshan
    shanraisshan about 7 years
    You can download the list of emoji in string/int format present in #EmojiCodeSheet here, for custom comparator.
  • Nomiluks
    Nomiluks about 6 years
    your code cannot detect flags in the text : extract_emojis("πŸ‡΅πŸ‡° πŸ‘§ 🏿")
  • Nomiluks
    Nomiluks about 6 years
    Your code is working good, but how can we handle flags? "πŸ‡΅πŸ‡° "
  • Pedro Castilho
    Pedro Castilho about 6 years
    @NomanDilawar that is because my code iterates over every character. Unicode flags are a combination of two "regional indicator" characters which are not, individually, emoji. If you want to detect Unicode flags you'll need to check pairs of characters.
  • sheldonzy
    sheldonzy about 6 years
    @NomanDilawar Hi, sorry for the delay. I edited my answer. I ran some tests and it seems to work fine now.
  • kingmakerking
    kingmakerking almost 6 years
    UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if any(char in emoji.UNICODE_EMOJI for char in word): is what I am getting.
  • Paulo Malvar
    Paulo Malvar about 5 years
    This is the only solution that I found to work comprehensively for all emojis I've encountered so far.
  • Amir Shabani
    Amir Shabani over 4 years
    You can replace print(' '.join(emoji for emoji in counter)) with print(' '.join(counter)). Does the same thing.
  • Amir Shabani
    Amir Shabani over 4 years
    Also, I think it's better to write for grapheme in data: instead of for word in data: as it reflects the purpose of \X better.
  • Alex
    Alex about 3 years
    As of emoji v.1.2.0, the check must include a language specifier, e.g. any(char in emoji.UNICODE_EMOJI["en"] for char in grapheme)
  • msarafzadeh
    msarafzadeh about 3 years
    @Nomiluks I had to filter it either per language or do a recursive dictionary search. 'πŸ‘Ά' in emoji.UNICODE_EMOJI['en']
  • Samuelf80
    Samuelf80 about 3 years
    Thank you! Out of all the responses on the page, this worked the best
  • Jesse Aldridge
    Jesse Aldridge about 3 years
    Doesn't work in Python 3.6? I get an empty string.
  • Matteo
    Matteo almost 3 years
    The answer has been updated to include ['en']. It should work again now.
  • Henry Munro
    Henry Munro over 2 years
    I just needed to do a quick search of a code base and the following got what I needed: [^\w\s,;"{}='!*:\./[[]\-()\$#<>&@|\^`\+\?\\~‌​%β€˜β€™£β‚΅,€] Not scalable, but just in case anyone else finds it useful.