Python - Reading Emoji Unicode Characters
Solution 1
I don't think you're using encode correctly, nor do you need to. What you have is a valid unicode string with one 4 digit and one 8 digit escape sequence. Try this in the REPL on, say, OS X
>>> s = u'that\u2019s \U0001f63b'
>>> print s
thatβs π»
In python3, though -
Python 3.4.3 (default, Jul 7 2015, 15:40:07)
>>> s = u'that\u2019s \U0001f63b'
>>> s[-1]
'π»'
Solution 2
Your last part of confusion is likely due to the fact that you are running what is called a "narrow Python build". Python can't hold a single character with enough information to hold a single emoji. The best solution would be to move to Python 3. Otherwise, try to process the UTF-16 surrogate pair.
Andrew LaPrise
πππππ ππΏπΏπΏπ πππππ
Updated on July 22, 2022Comments
-
Andrew LaPrise almost 2 years
I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:
u'that\u2019s \U0001f63b'
The apostrophe is represented by
\u2019
, but the emoji is represented by\U0001f63b
. I looked up the code point for the emoji in question, and it's\uf63b
. I'm not sure where the0001
is coming from. I know comically little about character encodings.When I print the text, character by character, using:
s = u'that\u2019s \U0001f63b' for c in s: print c.encode('unicode_escape')
The program produces the following output:
t h a t \u2019 s \ud83d \ude3b
How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those
0001
s before reading it, or is there an easier, less silly way? -
Andrew LaPrise almost 9 yearsWell would ya look at that... I really know nothing about nothing. Thanks! I'm still not clear how to read just that last character though. s[-1] and s[-2] still give '\ud83d' and '\ude3b'. Is there a way to read the string character by character?
-
Mark Ransom almost 9 years@alaprise you're seeing an artifact of the way Python stores its Unicode strings internally. If you did the same thing in Python 3 you'd see something different entirely.
-
pvg almost 9 years@alaprise The other answer has some good info, of which the summary is 'if possible move to Python3'. Otherwise you're entering a world of pain/surrogate pairs/words you don't want to know for they are the song of Cthulhu
-
roeland almost 9 years'\ud83d' and '\ude3b' is a surrogate pair, used by UTF-16 to represent a code point above
U+FFFF
. This is a bug in Python 2, a lot of languages have that problem with those characters. -
jfs almost 9 years@roeland:
s[-1] == u'\U0001f63b'
on both Python 2 and 3 on my machine ("wide Python builds" are supported since 2001) -
jfs almost 9 years@alaprise: see How to install python on Mac with wide-build
-
jfs almost 9 years
regex.findall(r'\X', unicode_text)
could be used to get "user-perceived characters" that may span more than one Unicode codepoint (it is unrelated to surrogate pairs but it should fix the issue as a side effect). -
Jeef over 7 yearsI cant get this working with the warning sign: u'\U000026A0' - it comes out as a text glyph not emoji.