Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
Solution 1
As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.
This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py
, After this module has been evaluated, the setdefaultencoding()
function is removed from the sys
module.
The only way to actually use it is with a reload hack that brings the attribute back.
Also, the use of sys.setdefaultencoding()
has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to "utf-8" and changing it raises an error.
I suggest some pointers for reading:
- http://blog.ianbicking.org/illusive-setdefaultencoding.html
- http://nedbatchelder.com/blog/200401/printing_unicode_from_python.html
- http://www.diveintopython3.net/strings.html#one-ring-to-rule-them-all
- http://boodebr.org/main/python/all-about-python-and-unicode
- http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
Solution 2
tl;dr
The answer is NEVER! (unless you really know what you're doing)
9/10 times the solution can be resolved with a proper understanding of encoding/decoding.
1/10 people have an incorrectly defined locale or environment and need to set:
PYTHONIOENCODING="UTF-8"
in their environment to fix console printing problems.
What does it do?
(struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:sys.setdefaultencoding("utf-8")
str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC")
In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
(My console is configured as UTF-8, so "€" = '\xe2\x82\xac'
, hence exception on \xe2
)
or
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
will allow these to work for me, but won't necessarily work for people who don't use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into codesys.setdefaultencoding("utf-8")
Console
also has a side effect of appearing to fix sys.setdefaultencoding("utf-8")
sys.stdout.encoding
, used when printing characters to the console. Python uses the user's locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user's locale is broken and just requires PYTHONIOENCODING
to fix the console encoding.
Example:
$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()
$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€
What's so bad with sys.setdefaultencoding("utf-8")?
People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeError
exception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.
From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
def welcome_message(byte_string):
try:
return u"%s runs your business" % byte_string
except UnicodeError:
return u"%s runs your business" % unicode(byte_string,
encoding=detect_encoding(byte_string))
print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))
Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.
Changing what should be a constant will have dramatic effects on modules you depend upon. It's better to just fix the data coming in and out of your code.
Example problem
While the setting of defaultencoding to UTF-8 isn't the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
Solution 3
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u
chmod +x test.py
./test.py
moçambique
moçambique
./test.py > output.txt
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)
on shell works , sending to sdtout not , so that is one workaround, to write to stdout .
I made other approach, which is not run if sys.stdout.encoding is not define, or in others words , need export PYTHONIOENCODING=UTF-8 first to write to stdout.
import sys
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
so, using same example:
export PYTHONIOENCODING=UTF-8
./test.py > output.txt
will work
Solution 4
-
The first danger lies in
reload(sys)
.When you reload a module, you actually get two copies of the module in your runtime. The old module is a Python object like everything else, and stays alive as long as there are references to it. So, half of the objects will be pointing to the old module, and half to the new one. When you make some change, you will never see it coming when some random object doesn't see the change:
(This is IPython shell) In [1]: import sys In [2]: sys.stdout Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8> In [3]: reload(sys) <module 'sys' (built-in)> In [4]: sys.stdout Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0> In [11]: import IPython.terminal In [14]: IPython.terminal.interactiveshell.sys.stdout Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>
-
Now,
sys.setdefaultencoding()
properAll that it affects is implicit conversion
str<->unicode
. Now,utf-8
is the sanest encoding on the planet (backward-compatible with ASCII and all), the conversion now "just works", what could possibly go wrong?Well, anything. And that is the danger.
- There may be some code that relies on the
UnicodeError
being thrown for non-ASCII input, or does the transcoding with an error handler, which now produces an unexpected result. And since all code is tested with the default setting, you're strictly on "unsupported" territory here, and no-one gives you guarantees about how their code will behave. - The transcoding may produce unexpected or unusable results if not everything on the system uses UTF-8 because Python 2 actually has multiple independent "default string encodings". (Remember, a program must work for the customer, on the customer's equipment.)
- Again, the worst thing is you will never know that because the conversion is implicit -- you don't really know when and where it happens. (Python Zen, koan 2 ahoy!) You will never know why (and if) your code works on one system and breaks on another. (Or better yet, works in IDE and breaks in console.)
- There may be some code that relies on the
Related videos on Youtube
mlzboy
i'm a python freelancer,currently living in YiWu zhengjiang province,china before i use ubuntu & python,i have experience on html,ajax,.net,java,asp etc, now i'm a totally ubuntu & python fan. you can visit my tech blog on http://lexus.cnblogs.com currently i maintence a eCommerical search engine http://www.15-1688.com
Updated on August 22, 2020Comments
-
mlzboy about 3 years
I have seen few py scripts which use this at the top of the script. In what cases one should use it?
import sys reload(sys) sys.setdefaultencoding("utf-8")
-
seanv507 over 8 yearsthere is a problem with using this in ipython, %time stops working github.com/ipython/ipython/issues/8071
-
Alastair McCormack almost 8 years@seanv507, read the answers - using it is seriously discouraged
-
idbrii almost 8 years
-
smci almost 6 yearsHow is this not an exact duplicate of Dangers of sys.setdefaultencoding('utf-8')? Although this (2010) asking predates that one (2015)? But that asking has good answers too. What to do? Also, to be clear, this question only makes sense on Python 2 not 3, yet that's nowhere tagged or mentioned.
-
ccpizza almost 4 yearsworth reading before diving into SO answers: pythonhosted.org/kitchen/unicode-frustrations.html
-
-
mbb almost 11 yearsGreat stuff, though there's a bit of death by too much information here. I learned the most just focusing on this article: blog.notdot.net/2010/07/Getting-unicode-right-in-Python
-
Bruno Feroleto over 10 yearsI would like to add that the default encoding is also used for encoding (when writing to
sys.stdout
when it has aNone
encoding, like when redirecting the output of a Python program). -
jfs over 9 years+1 for "the use of
sys.setdefaultencoding()
has always been discouraged" -
Tino about 8 years'hard-wired to utf-8' is not true, it's not hardwired and it's not always
UTF-8
.LC_ALL=en_US.UTF-8 python3 -c 'import sys; print(sys.stdout.encoding)'
givesUTF-8
butLC_ALL=C python3 -c 'import sys; print(sys.stdout.encoding)'
givesANSI_X3.4-1968
(or perhaps something else) -
Alastair McCormack almost 8 years@Tino, console encoding is separate to default encoding.
-
Tino almost 8 years@AlastairMcCormack Thank you for correcting me, this made me aware that
stdin/stdout/stderr
are now (it was completely different before) independent fromsys.getdefaultencoding()
andsetdefaultencoding
only accepts 'utf-8'. So please ignore my first sentence, but the rest of my comment still might help others to not fall into the same trap. -
Alastair McCormack almost 8 years@Tino, it's a complex and poorly documented part of the language. The more that's written about it the better :)
-
Yongwei Wu over 6 yearsWhile there are surprises in
sys.setdefaultencoding("utf-8")
, it is good to make the code behave more like Python 3. It is 2017 now. Even when you wrote the answer back in 2015, I think it was already better to look forward instead of backward. It was actually the simplest solution for me, when I found my code behave differently in Python 2 depending on whether the output is redirected (very nasty problem for Python 2). Needless to say, I already have# coding: utf-8
, and I do not need any workarounds for Python 3 (I actually have to mask thesetdefaultencoding
using version check). -
Alastair McCormack over 6 yearsThat's great and it works for you but
sys.setdefaultencoding("utf-8")
does not make your Py 2.x code compatible with Python 3. Nor does it fix external modules that assumes the default encoding is ASCII. Making your code Python 3 compatible is very simple and doesn't require this nasty hack. For example why this causes very real problems, see my experience with Amazon messing with this assumption: stackoverflow.com/questions/39465220/… -
sam almost 6 years@AlastairMcCormack you rock, My site has been since months and could not figure out what to do. Finally,
PYTHONIOENCODING="UTF-8"
helped my Python2.7 Django-1.11 environment. Thanks. -
dlamblin over 5 yearsI know you copied the example, but I can find what package has
detect_encoding
. -
Alastair McCormack over 5 years@dlamblin The code example is to prove the quote and is not supposed to be used in your code. Imagine that
detect_encoding
is a method that could detect the encoding of a string based on language clues. -
ivan_pozdeev over 5 yearsThis doesn't really answer the question. "Discouraged" is not an argument.
-
ivan_pozdeev about 5 yearsThis doesn't answer the question as asked. Rather some tangential thoughts on the subject.
-
Tim Bird over 3 yearsThis is a great answer that adds more clarity to the possible issues. Unfortunately setting PYTHONIOENCODING didn't work for me. I never could figure out why not. In my case, a module was using str() on some text it received from a server. There was no way for me to control that use of str(), so I was stuck. All the answers I have seen amount to "control your inputs and outputs", but that control was outside of my code.
-
Alastair McCormack over 3 years@TimBird do you want a hand with further investigation? Could it be that your encoding is UTF-8 but the text received from the remote server isn't UTF-8 at all?
-
DimeCadmium almost 2 years@AlastairMcCormack in other words "Imagine that
detect_encoding
is a magical method, which doesn't exist in reality, and will tell you the encoding of a string even when misconfigured". The only practical implementation of this would involve (a) application-specific heuristics inappropriate for a library; (b) separately transmitting an encoding (not always within control), or (c) just guessing at a default or fallback encoding - which is exactly what sys.setdefaultencoding does. As far as Py2/3 compat, it is NOT simple, especially for IO-focused software which needs significant mods to work. -
Alastair McCormack almost 2 years@DimeCadmium it's should be perfectly simple if the original developer of the code understood text encoding and didn't ambiguously encode and decode in-memory bytes/strs. Nothing changed in Py3 except Py3 made it harder to mess up (by working more like Java and C#). As Py 2.x has
Unicode
andbytes
, there should be little issue keeping Py2 and Py3 code compatible. Remember to use the Python 2.xio
package (directly or via six) to simplify your actual IO. Of course, Py2 is now obsolete, so most of this is moot. Always happy to help. Also, look at Chardet fordetect_encoding
magic. -
DimeCadmium almost 2 years@AlastairMcCormack here in reality you usually aren't in control of the other software which produces/consumes the things you consume/produce. There are far more differences between Py2 and Py3 than just string types.