Convert Unicode to ASCII without errors in Python
Solution 1
2018 Update:
As of February 2018, using compressions like gzip
has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte
In order to decode a gzpipped response you need to add the following modules (in Python 3):
import gzip
import io
Note: In Python 2 you'd use StringIO
instead of io
Then you can parse the content out like this:
response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource
This code reads the response, and places the bytes in a buffer. The gzip
module then reads the buffer using the GZipFile
function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.
Original Answer from 2010:
Can we get the actual value used for link
?
In addition, we usually encounter this problem here when we are trying to .encode()
an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xa0'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read()
to what applies to the content you retrieved.
Another problem I see there is that the .encode()
string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html)
as html is not the encoded string from html.encode (if that is what you were originally aiming for).
As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read()
. It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode()
.
Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).
Solution 2
>>> u'aあä'.encode('ascii', 'ignore')
'a'
Decode the string you get back, using either the charset in the the appropriate meta
tag in the response or in the Content-Type
header, then encode.
The method encode(encoding, errors)
accepts custom handlers for errors. The default values, besides ignore
, are:
>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'aあä'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'
See https://docs.python.org/3/library/stdtypes.html#str.encode
Solution 3
As an extension to Ignacio Vazquez-Abrams' answer
>>> u'aあä'.encode('ascii', 'ignore')
'a'
It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'
You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.
>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"
Although there are more efficient ways to accomplish this. See this question for more details Where is Python's "best ASCII for this Unicode" database?
Solution 4
Use unidecode - it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.
$ pip install unidecode
then:
>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'
Solution 5
I use this helper function throughout all of my projects. If it can't convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.
from django.utils import encoding
def convert_unicode_to_string(x):
"""
>>> convert_unicode_to_string(u'ni\xf1era')
'niera'
"""
return encoding.smart_str(x, encoding='ascii', errors='ignore')
I no longer get any unicode errors after using this.
Related videos on Youtube
themirror
Updated on February 02, 2022Comments
-
themirror almost 2 years
My code just scrapes a web page, then converts it to Unicode.
html = urllib.urlopen(link).read() html.encode("utf8","ignore") self.response.out.write(html)
But I get a
UnicodeDecodeError
:
Traceback (most recent call last): File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__ handler.get(*groups) File "/Users/greg/clounce/main.py", line 55, in get html.encode("utf8","ignore") UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?
-
jar about 7 yearsSeems like you may have encountered a "no break space" in the web page? would need to be preceded by a
c2
byte or you'd probably get a decode error: hexutf8.com/?q=C2A0 -
MRule about 2 yearsThe tile of this question should be revised to indicate that it is specifically about parsing the result of a HTML request, and is not about ''Converting Unicode to ASCII without errors in Python''.
-
-
John Machin over 13 yearsThat is SUPPRESSING the problem, not diagnosing and fixing. It's like saying "After I cut my feet off, I no longer have problems with corns and bunions".
-
Gattster over 13 yearsI agree it's suppressing the problem. It seems like that is what the question is after though. Look at his note: "Can I just drop whatever code bytes are causing the problem instead of getting an error?"
-
John Machin over 13 yearsAnd the answer to that kind of question should be a resounding NO!! He already has an error, ignoring it is worse!!
-
Ajith Antony over 11 yearsIn your example I think you meant for the last line to be
encoded_str = decoded_str.encode("utf8")
-
Joshua Burns about 11 yearsthis is exactly the same as simply calling "some-string".encode('ascii', 'ignore')
-
shanusmagnus about 10 yearsI cannot tell you how tired I am of someone asking a question on SO, and getting all these preachy responses. "My car won't start." "Why do you want to start your car? You should walk instead." Stop it!
-
shanusmagnus about 10 yearsBoth helpful in addressing the question that was asked, and practical for addressing issues that might be underlying the asked question. This is a model answer for this kind of question.
-
Gattster almost 10 years@user1244215 I totally agree with the point, but not your choice of diction in explaining said point.
-
Yablargo over 9 yearsThere are very real business cases in very real projects with very large dollars where, yes, it is absolutely OK to drop these characters.
-
Kerridge0 over 9 years@JoshuaBurns No it's not the same
-
Joshua Burns over 9 years@Kerridge0, just an objection? how about an explaination.
-
Kerridge0 over 9 years@JoshuaBurns well, I tried it your way first before even finding this page, then I tried Gattster's way and it worked, then I ready your comment and tried it your way again and it still didn't work. I could I guess look into the code and explain why it doesn't work, but you made the assertion that it is the same, so really that would be your responsibility I think.
-
Joshua Burns over 9 years@Kerridge0, I made the comment a while ago so I'll have to re-familiarize myself with why I made the statement. I'll either correct my statement or try to explain why I believe or know them to be similar.. although your comment makes me believe that there may be differences between the two now..
-
Gattster over 9 yearsI would expect @JoshuaBurns solution to be the same as mine, but in practice it doesn't work. As others have said, "ignore" should mean "ignore" :)
-
Aurielle Perlmann over 7 yearshalle-freakin-lujah - its about time i found an answer that worked for me
-
Sarvesh over 7 yearsUpvoted for fun value. Note that this mangles words in all accentuated languages. Škoda is not Skoda. Skoda most probably means something gross with eels and hovercrafts.
-
Dr Deo about 7 years@shanusmagnus Its worse if the preachy guy is a moderator. He simply votes to close your question !! programmer racism
-
sajid over 6 yearsThis does not work when you have a non ascii character like ü in the string.
-
Oliver Zendel about 6 yearsThis also works for the (unstandardized) "extended ascii" cases
-
Stephen over 5 yearsI've been scouring the internet for days until now.... thank you, thank you so much
-
Hyun-geun Kim about 4 yearsI tried in Python 2.7.15, and I got this message
raise IOError, 'Not a gzipped file'
. What is the fault I did? -
Mark Ransom almost 4 years@Gattster if you find
ignore
isn't working, it probably means Python is callingencode
ordecode
invisibly on your behalf and it isn't your code failing at all. Python 3 no longer does that. -
Mark Ransom almost 4 yearsThe
coding
comment is not a magic cure-all. You need to know why the error is being generated, this only fixes things when there are bad characters in your Python source. That doesn't appear to be the case for this question. -
juanfal about 2 yearsIgnoring chars is not a solution at all. It should be á → a, é → e, etc… as accented chars are not so important in, at least Spanish, but a simple way to help you to pronounce the words. You have to map the chars as there is no solutions to this neither with
iconv
nor with any otheri = orig.find(x); if i >= 0: x = dest[I]
whereoriginal
is something like:origL = 'áéíóúüç'
anddest=destL = 'aeiouuc'