Python encoding/decoding problems
Solution 1
I mapped the most common strange chars so this is pretty much complete answer based on the Oliver W. answer.
This function is by no means ideal,but it is the best place to start with. There are more chars definitions:
http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal
...
def unicodetoascii(text):
uni2ascii = {
ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
ord('\xc3\xa9'.decode('utf-8')): ord('e'),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),
ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),
ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),
}
return text.decode('utf-8').translate(uni2ascii).encode('ascii')
print unicodetoascii("weren\xe2\x80\x99t")
Solution 2
In Python 3 I would do it like this:
string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)
No translation maps needed, just code :)
Solution 3
You should provide a translation map that maps unicode characters to other unicode characters (the latter should be within the ASCII range if you want to re-encode to it):
uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring) # prints: "weren't"
Brana
Updated on June 09, 2022Comments
-
Brana almost 2 years
How do I decode strings such as this one "weren\xe2\x80\x99t" back to the normal encoding.
So this word is actually weren't and not "weren\xe2\x80\x99t"? For example:
print "\xe2\x80\x9cThings" string = "\xe2\x80\x9cThings" print string.decode('utf-8') print string.encode('ascii', 'ignore') “Things “Things Things
But I actually want to get "Things.
or:
print "weren\xe2\x80\x99t" string = "weren\xe2\x80\x99t" print string.decode('utf-8') print string.encode('ascii', 'ignore') weren’t weren’t werent
But I actually want to get weren't.
How should i do this?
-
Brana over 9 yearsI know that i can do this. But is there a ready map that can do this automatically?
-
AKMalkadi over 2 yearsI was looking for this answer!