UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed
Emojis in Unicode lie outside the Basic Multilingual Pane, which means they have codepoints that won't fit in 16 bits. Surrogate pairs are a way to make these glyphs directly representable in UTF-16 as a pair of 16-bit codepoints.
You can force surrogate pairs to be resolved into the corresponding codepoint outside the BMP like this:
"\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16')
This will give you the codepoint \U0001f604
. Note how it takes more than 4 hex digits to express.
But this solution may only get you so far.
A lot of software (including pygame
and older versions of IDLE, and PowerShell, and the Windows command prompt) only supports the BMP, because it doesn't really use UTF-16 but its predecessor UCS-2, which is essentially UTF-16 but without support for codepoints outside the BMP.
When this answer was originally posted, in IDLE 3.7 and before, print ('\U0001f604')
would just raise a UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f604' in position 0: Non-BMP character not supported in Tk
.
Python 3.8 finally fixed this and the fixes were backported to subsequent releases of Python 3.7, so in IDLE now, you can either provide the 17-bit codepoint:
print ('\U0001f604')
or transcode the UTF-16 surrogate pair to the same codepoint:
print ("\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16'))
and both will print 😄
.
What you still cannot do is print the UTF-16 surrogate pair as is: if you try print ("\ud83d\ude04")
you will get the same \u
escapes back.

Mohit Motwani
Contact me via LinkedIn or email: [email protected]
Updated on June 25, 2022Comments
-
Mohit Motwani 6 months
When I try to use:
df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_csv('sentiment_data.csv')
I get the error:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed
I don't understand what this error means and how I can fix this error and export my data to a csv/excel. I have referred to this question but I don't understand much and it doesn't answer how to do this with pandas.
What does position 388 mean? What is the character '\ud83d'?
I get a different error position when I try to export to an excel:
df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_excel('sentiment_data_new.xlsx')
Error while exporting to excel:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 261: surrogates not allowed
Why is the position different when it's the same encoding?
The other duplicate questions don't answer how to escape this error with pandas DataFrame.
-
Mohit Motwani almost 4 yearsThank you for this explanation
-
ali reza 7 monthstnx a lot. that worked for me.