Python - Reading Emoji Unicode Characters

15,503

Solution 1

I don't think you're using encode correctly, nor do you need to. What you have is a valid unicode string with one 4 digit and one 8 digit escape sequence. Try this in the REPL on, say, OS X

>>> s = u'that\u2019s \U0001f63b'
>>> print s
that’s 😻

In python3, though -

Python 3.4.3 (default, Jul  7 2015, 15:40:07) 
>>> s  = u'that\u2019s \U0001f63b'
>>> s[-1]
'😻'

Solution 2

Your last part of confusion is likely due to the fact that you are running what is called a "narrow Python build". Python can't hold a single character with enough information to hold a single emoji. The best solution would be to move to Python 3. Otherwise, try to process the UTF-16 surrogate pair.

Share:
15,503
Andrew LaPrise
Author by

Andrew LaPrise

πŸ’•πŸ’•πŸ’•πŸ’•πŸ’• πŸ’•πŸ—ΏπŸ—ΏπŸ—ΏπŸ’• πŸ’•πŸ’•πŸ’•πŸ’•πŸ’•

Updated on July 22, 2022

Comments

  • Andrew LaPrise
    Andrew LaPrise almost 2 years

    I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:

    u'that\u2019s \U0001f63b'
    

    The apostrophe is represented by \u2019, but the emoji is represented by \U0001f63b. I looked up the code point for the emoji in question, and it's \uf63b. I'm not sure where the 0001 is coming from. I know comically little about character encodings.

    When I print the text, character by character, using:

    s = u'that\u2019s \U0001f63b'
    
    for c in s:
        print c.encode('unicode_escape')
    

    The program produces the following output:

    t
    h
    a
    t
    \u2019
    s
    
    \ud83d
    \ude3b
    

    How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those 0001s before reading it, or is there an easier, less silly way?

  • Andrew LaPrise
    Andrew LaPrise almost 9 years
    Well would ya look at that... I really know nothing about nothing. Thanks! I'm still not clear how to read just that last character though. s[-1] and s[-2] still give '\ud83d' and '\ude3b'. Is there a way to read the string character by character?
  • Mark Ransom
    Mark Ransom almost 9 years
    @alaprise you're seeing an artifact of the way Python stores its Unicode strings internally. If you did the same thing in Python 3 you'd see something different entirely.
  • pvg
    pvg almost 9 years
    @alaprise The other answer has some good info, of which the summary is 'if possible move to Python3'. Otherwise you're entering a world of pain/surrogate pairs/words you don't want to know for they are the song of Cthulhu
  • roeland
    roeland almost 9 years
    '\ud83d' and '\ude3b' is a surrogate pair, used by UTF-16 to represent a code point above U+FFFF. This is a bug in Python 2, a lot of languages have that problem with those characters.
  • jfs
    jfs almost 9 years
    @roeland: s[-1] == u'\U0001f63b' on both Python 2 and 3 on my machine ("wide Python builds" are supported since 2001)
  • jfs
    jfs almost 9 years
  • jfs
    jfs almost 9 years
    regex.findall(r'\X', unicode_text) could be used to get "user-perceived characters" that may span more than one Unicode codepoint (it is unrelated to surrogate pairs but it should fix the issue as a side effect).
  • Jeef
    Jeef over 7 years
    I cant get this working with the warning sign: u'\U000026A0' - it comes out as a text glyph not emoji.