Match unicode emoji in python regex

14,264

Solution 1

Since there are a lot of emoji with different unicode values, you have to explicitly specify them in your regex, or if they are with a spesific range you can use a character class. In this case your second simbol is not a standard emoji, it's just a unicode character, but since it's greater than \u263a (the unicode representation of ☺️) you can put it in a range with \u263a:

In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 🙅 bjvcvvv'

In [72]: regex = re.compile(r'\d+(.*?)(?:\u263a|\U0001f645)')

In [74]: regex.findall(s)
Out[74]: [' extract1  ', ' extract2 ']

Or if you want to match more emojies you can use a character range (here is a good reference which shows you the proper range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode):

In [75]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [76]: regex.findall(s)
Out[76]: [' extract1  ', ' extract2 ']

Note that in second case you have to make sure that all the characters withn the aforementioned range are emojies that you want.

Here is another example:

In [77]: s = "blah 4 xzuyguhbc 😺 ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 🙅 bjvcvvv"

In [78]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [79]: regex.findall(s)
Out[79]: [' xzuyguhbc ', ' extract1  ', ' extract2 ']

Solution 2

Here's my stab at the solution. Not sure if it will work in all circumstances. The trick is to convert all unicode emojis into normal text. This could be done by following this post Then you can match the emoji just as any normal text. Note that it won't work if the literal strings \u or \U is in your searched text.

Example: Copy your string into a file, let's call it emo. In terminal:

Chip chip@ 03:24:33@ ~: cat emo | python stackoverflow.py
blah xzuyguhbc ibcbb bqw 2 extract1  \u263a\ufe0f jbjhcb 6 extract2 \U0001f645 bjvcvvv\n
------------------------
[' extract1  ', ' extract2 ']

Where stackoverflow.py file is:

import fileinput
a = fileinput.input();
for line in a:
    teststring = unicode(line,'utf-8')
    teststring = teststring.encode('unicode-escape')

import re
print teststring
print "------------------------"
m = re.findall('(?<=[\s][\d])(.*?)(?=\\\\[uU])', teststring)
print m
Share:
14,264
LeDerp
Author by

LeDerp

Abhishek Kalyan. :p

Updated on June 09, 2022

Comments

  • LeDerp
    LeDerp almost 2 years

    I need to extract the text between a number and an emoticon in a text

    example text:

    blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 🙅 bjvcvvv
    

    output:

    extract1
    extract2
    

    The regex code that I wrote extracts the text between 2 numbers, I need to change the part where it identifies the unicode emoji characters and extracts text between them.

    (?<=[\s][\d])(.*?)(?=[\d])
    

    Please suggest a python friendly method, and I need it to work with all the emoji's not only the one's given in the example

    https://regex101.com/r/uT1fM0/1