How to encode/decode this file in Python?

18,796

You are looking at byte string values, printed as repr() results because they are contained in a dictionary. String representations can be re-used as Python string literals and non-printable and non-ASCII characters are shown using string escape sequences. Container values are always represented with repr() to ease debugging.

Thus, the string 'K\xc3\xa4se' contains two non-ASCII bytes with hex values C3 and A4, a UTF-8 combo for the U+00E4 codepoint.

You should decode the values to unicode objects:

with open('dictionary.txt') as my_file:
    for line in my_file:   # just loop over the file
        if line.strip(): # ignoring blank lines
            key, value = line.decode('utf8').strip().split(':')
            words[key] = value

or better still, use codecs.open() to decode the file as you read it:

import codecs

with codecs.open('dictionary.txt', 'r', 'utf8') as my_file:
    for line in my_file:
        if line.strip(): # ignoring blank lines
            key, value = line.strip().split(':')
            words[key] = value

Printing the resulting dictionary will still use repr() results for the contents, so now you'll see u'cheese': u'K\xe4se' instead, because \xe4 is the escape code for Unicode point 00E4, the ä character. Print individual words if you want the actual characters to be written to the terminal:

print words['cheese']

But now you can compare these values with other data that you decoded, provided you know their correct encoding, and manipulate them and encode them again to whatever target codec you needed to use. print will do this automatically, for example, when printing unicode values to your terminal.

You may want to read up on Unicode and Python:

Share:
18,796
skamsie
Author by

skamsie

Updated on June 04, 2022

Comments

  • skamsie
    skamsie almost 2 years

    I am planning to make a little Python game that will randomly print keys (English) out of a dictionary and the user has to input the value (in German). If the value is correct, it prints 'correct' and continue. If the value is wrong, it prints 'wrong' and breaks.

    I thought this would be an easy task but I got stuck on the way. My problem is I do not know how to print the German characters. Let's say I have a file 'dictionary.txt' with this text:

    cat:Katze
    dog:Hund
    exercise:Übung
    solve:lösen
    door:Tür
    cheese:Käse
    

    And I have this code just to test how the output looks like:

    # -*- coding: UTF-8 -*-
    words = {} # empty dictionary
    with open('dictionary.txt') as my_file:
      for line in my_file.readlines():
        if len(line.strip())>0: # ignoring blank lines
          elem = line.split(':') # split on ":"
          words[elem[0]] = elem[1].strip() # appending elements to dictionary
    print words
    

    Obviously the result of the print is not as expected:

        {'cheese': 'K\xc3\xa4se', 'door': 'T\xc3\xbcr',
         'dog': 'Hund', 'cat': 'Katze', 'solve': 'l\xc3\xb6sen',
         'exercise': '\xc3\x9cbung'}
    

    So where do I add the encoding and how do I do it?

    Thank you!