Python - Reading Emoji Unicode Characters

python python-2.7 unicode emoji

15,503

Solution 1

I don't think you're using encode correctly, nor do you need to. What you have is a valid unicode string with one 4 digit and one 8 digit escape sequence. Try this in the REPL on, say, OS X

>>> s = u'that\u2019s \U0001f63b'
>>> print s
that’s 😻

In python3, though -

Python 3.4.3 (default, Jul  7 2015, 15:40:07) 
>>> s  = u'that\u2019s \U0001f63b'
>>> s[-1]
'😻'

Solution 2

Your last part of confusion is likely due to the fact that you are running what is called a "narrow Python build". Python can't hold a single character with enough information to hold a single emoji. The best solution would be to move to Python 3. Otherwise, try to process the UTF-16 surrogate pair.

15,503

Author by

Andrew LaPrise

💕💕💕💕💕 💕🗿🗿🗿💕 💕💕💕💕💕

Updated on July 22, 2022

Comments

Andrew LaPrise almost 2 years
I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:
```
u'that\u2019s \U0001f63b'
```
The apostrophe is represented by \u2019, but the emoji is represented by \U0001f63b. I looked up the code point for the emoji in question, and it's \uf63b. I'm not sure where the 0001 is coming from. I know comically little about character encodings.

When I print the text, character by character, using:
```
s = u'that\u2019s \U0001f63b'

for c in s:
    print c.encode('unicode_escape')
```
The program produces the following output:
```
t
h
a
t
\u2019
s

\ud83d
\ude3b
```
How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those 0001s before reading it, or is there an easier, less silly way?
Andrew LaPrise almost 9 years

Well would ya look at that... I really know nothing about nothing. Thanks! I'm still not clear how to read just that last character though. s[-1] and s[-2] still give '\ud83d' and '\ude3b'. Is there a way to read the string character by character?
Mark Ransom almost 9 years

@alaprise you're seeing an artifact of the way Python stores its Unicode strings internally. If you did the same thing in Python 3 you'd see something different entirely.
pvg almost 9 years

@alaprise The other answer has some good info, of which the summary is 'if possible move to Python3'. Otherwise you're entering a world of pain/surrogate pairs/words you don't want to know for they are the song of Cthulhu
roeland almost 9 years

'\ud83d' and '\ude3b' is a surrogate pair, used by UTF-16 to represent a code point above U+FFFF. This is a bug in Python 2, a lot of languages have that problem with those characters.
jfs almost 9 years

@roeland: s[-1] == u'\U0001f63b' on both Python 2 and 3 on my machine ("wide Python builds" are supported since 2001)
jfs almost 9 years

@alaprise: see How to install python on Mac with wide-build
jfs almost 9 years

regex.findall(r'\X', unicode_text) could be used to get "user-perceived characters" that may span more than one Unicode codepoint (it is unrelated to surrogate pairs but it should fix the issue as a side effect).
Jeef over 7 years

I cant get this working with the warning sign: u'\U000026A0' - it comes out as a text glyph not emoji.