Utf-8 on windows python
Solution 1
UnicodeEncodeError
suggests that the code fails while encoding Unicode text to bytes i.e., your actual code tries to print to Windows console. See Python, Unicode, and the Windows console.
The link above fixes UnicodeEncodeError
. The next issue is to find out what character encoding is used by the text in your "path"
file. If notepad.exe
shows the text correctly then it means that it is either encoded using locale.getprefferedencoding(False)
(something like cp1252
on Windows) or the file has BOM.
If you are sure that the encoding is utf-8 then pass it to open()
directly. Don't use codecs.open()
:
with open('path', encoding='utf-8') as file:
html = file.read()
Sometimes, the input may contain text encoded using multiple (inconsistent) encodings e.g., smart quotes may be encoded using cp1252
while the rest of html is utf-8 -- you could fix it using bs4.UnicodeDammit
. See also A good way to get the charset/encoding of an HTTP response in Python
Solution 2
In anticipation of the OP to update question to reflect the actual problem, the issue is caused by the encoding of the terminal not being defined.
The Windows console is notoriously poor when it comes to Unicode support. For ultimate support, see https://pypi.python.org/pypi/win_unicode_console. Essentially, install "win_unicode_console" (pip install win_unicode_console
). Then at the top of your code:
import win_unicode_console
win_unicode_console.enable()
You may also need to use a suitable font - See https://stackoverflow.com/a/5750227/1554386
As you're using an input with a UTF-8 BOM, you should use the utf_8_sig
codec so that the BOM is stripped before working with the contents.
As this is Python 3, you don't need to use the codecs
module to set encoding when using open()
.
Putting it together it would look like:
import win_unicode_console
win_unicode_console.enable()
infile = open("path", "r", encoding="utf_8_sig")
taspai
Updated on June 26, 2022Comments
-
taspai almost 2 years
I have html file to read parse etc, it's encode on unicode (I saw it with the notepad) but when I tried
infile = open("path", "r") infile.read()
it fails and I had the famous error :
UnicodeEncodeError: 'charmap' codec can't encode characters in position xx: character maps to undefined
So for test I tried to copy paste the contain of the file in a new one and save it in utf-8 and then tried to open it with codecs like this :
inFile = codecs.open("path", "r", encoding="utf-8") outputStream = inFile.read()
But I get this error message :
UnicodeEncodeError : 'charmap' codec can't encode character u'\ufeff' in position 0: charcater maps to undefined
I really don't understand because I was created this file in utf8.
-
ShadowRanger over 8 yearsAs a side-note: Don't use
codecs.open
. On Py3, you can pass anencoding
argument to regularopen
, and on Py2.7, you can importio.open
(which is the same as Py3's built-inopen
) and do the same.codecs.open
has some dumb quirks (e.g. doesn't do universal new line handling). -
jfs over 8 yearsit is best to avoid modifying the script. You could run it using
run
module instead (a part ofwin-unicode-console
):py -m run your-unicode-printing-script.py
or if it is appropriate in your case then putwin_unicode_console.enable()
call intositecustomize
orusercustomize
modules. -
roeland over 8 yearsIf Notepad says “Unicode” (as the OP said) it means UTF-16. The other encodings are usually called “ANSI” (cp1252 and friends) and “UTF-8” (which is UTF-8 with BOM).
-
jfs over 8 years@roeland: yes. "it's encode on unicode (I saw it with the notepad)" from the question can be interpreted that way. The issue with that theory is that
codecs.open("path", encoding='utf-8').read()
returnsu'\ufeff'
i.e.,utf-8-sig
is more likely.'utf-8'
encoding fails for bothBOM_UTF16_BE
andBOM_UTF16_LE
. -
roeland over 8 yearsYeah, the question is a bit confusing as it involves two files, the original file in “Unicode”, and the file he re-saved as “UTF-8”.
-
jfs over 8 years@roeland: anyway the issue is
UnicodeEncodeError
i.e., when OP tries to print Unicode text to Windows console. -
roeland over 8 yearsAha, I see. That was subtle