Python encoding/decoding problems

10,563

Solution 1

I mapped the most common strange chars so this is pretty much complete answer based on the Oliver W. answer.

This function is by no means ideal,but it is the best place to start with. There are more chars definitions:

http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

...

def unicodetoascii(text):

    uni2ascii = {
            ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
            ord('\xc3\xa9'.decode('utf-8')): ord('e'),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),

            ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),

            ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),

            ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
            ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
            ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
            ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
            ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),

                            }
    return text.decode('utf-8').translate(uni2ascii).encode('ascii')

print unicodetoascii("weren\xe2\x80\x99t")  

Solution 2

In Python 3 I would do it like this:

string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)

No translation maps needed, just code :)

Solution 3

You should provide a translation map that maps unicode characters to other unicode characters (the latter should be within the ASCII range if you want to re-encode to it):

uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}    
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring)  # prints: "weren't"
Share:
10,563
Brana
Author by

Brana

Updated on June 09, 2022

Comments

  • Brana
    Brana almost 2 years

    How do I decode strings such as this one "weren\xe2\x80\x99t" back to the normal encoding.

    So this word is actually weren't and not "weren\xe2\x80\x99t"? For example:

    print "\xe2\x80\x9cThings"
    string = "\xe2\x80\x9cThings"
    print string.decode('utf-8')
    print string.encode('ascii', 'ignore')
    
    “Things
    “Things
    Things
    

    But I actually want to get "Things.

    or:

    print "weren\xe2\x80\x99t"
    string = "weren\xe2\x80\x99t"
    print string.decode('utf-8')
    print string.encode('ascii', 'ignore')
    
    weren’t
    weren’t
    werent
    

    But I actually want to get weren't.

    How should i do this?

  • Brana
    Brana over 9 years
    I know that i can do this. But is there a ready map that can do this automatically?
  • AKMalkadi
    AKMalkadi over 2 years
    I was looking for this answer!