Python - Encoding string - Swedish Letters

22,141

Solution 1

Solution to a lot of problems:


Edit: C:\Python??\Lib\Site.py Replace "del sys.setdefaultencoding" with "pass"

Then,
Put this in the top of your code:

sys.setdefaultencoding('latin-1')

The holy grail of fixing the Swedish/non-UTF8 compatible characters.

Solution 2

You mention the fact that you received an encoding error which motivated you to write swedify in the first place, and you have found solutions around chcp which is a Windows command.

On *nix systems with UTF-8 terminals, swedify is not necessary:

>>> raw_input('Hur långt i kilometer är ditt mål: ')
Hur långt i kilometer är ditt mål: 100
'100'
>>> a = raw_input('Hur långt i kilometer är ditt mål: ')
Hur långt i kilometer är ditt mål: 200
>>> a
'200'

FWIW, when I do use swedify, I get the same error you do:

>>> def swedify(inp):
...     try:
...         return inp.decode('utf-8')
...     except:
...         return '(!Dec:) ' + str(inp)
... 
>>> swedify('Hur långt i kilometer är ditt mål: ') 
u'Hur l\xe5ngt i kilometer \xe4r ditt m\xe5l: '
>>> raw_input(swedify('Hur långt i kilometer är ditt mål: '))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 5: ordinal not in range(128)

Your swedify function returns a unicode object. The built-in raw_input is just not happy with unicode objects.

>>> raw_input("å")
åeee
'eee'
>>> raw_input(u"å")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)

You might want to try this in Python 3. See this Python bug.

Also of interest: How to read Unicode input and compare Unicode strings in Python?.

UPDATE According to this blog post there is a way to set the system's default encoding. This might be worth a try.

Solution 3

For me it worked fine with:

#-*- coding: utf-8 -*-
import sys
import codecs
koden=sys.stdin.encoding

a=raw_input( u'Frågan är öppen? '.encode(koden))
print a

Per

Solution 4

On Windows, the console's native Unicode support is broken. Even the apparent UTF-8 codepage isn't a proper fix.

To read and write with Windows console you need use https://github.com/Drekin/win-unicode-console, which works directly with the underlying console API, so that multi-byte characters are read and written correctly.

Share:
22,141
Torxed
Author by

Torxed

Not much to say to be honest, Alongside everyone here i'm "ish" a beginner in Python/C/C++/ASM but i enjoy what i do and i do it for fun which i think is a key factor to keep on doing the things i like. I'm probably a duct tape developer by hobby, I make stuff work - fast. It's not always the most pretty thing in the world, but it'll do the job.

Updated on December 27, 2020

Comments

  • Torxed
    Torxed over 3 years

    I'm having some trouble with Python's raw_input command (Python2.6), For some reason, the raw_input does not get the converted string that swedify() produces and this giving me a encoding error which i'm aware of, that's why i made swedify() to begin with. Here's what i'm trying to do:

    elif cmd in ('help', 'hjälp', 'info'):
        buffert += 'Just nu är programmet relativt begränsat,\nDe funktioner du har att använda är:\n'
        buffert += ' * historik :: skriver ut all din historik\n'
        buffert += ' * ändra <något> :: ändrar något i databasen, följande finns att ändra:\n'
        print swedify(buffert)
    

    This works just fine, it outputs the swedish characters just as i want them to the console. But when i try to (in the same code, with same \x?? values, print this piece:

    core['goalDistance'] = raw_input(swedify('Hur långt i kilometer är ditt mål: '))
    core['goalTime'] = raw_input(swedify('Vad är ditt mål i minuter att springa ' +  core['goalDistance'] + 'km på: '))
    

    Then i get this:

    C:\Users\Anon>python löp.py
    Traceback (most recent call last):
      File "l÷p.py", line 92, in <module>
        core['goalDistance'] = raw_input(swedify('Hur långt i kilometer är ditt mål: '))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 5: ordinal not in range(128)
    

    Now i've googled around, found some "solutions" but none of them work, some sad that i have to create a batch script that executes chcp ??? in the beginning, but that's not a clean solution IMO.

    Here is swedify:

    def swedify(inp):
        try:
            return inp.decode('utf-8')
        except:
            return '(!Dec:) ' + str(inp)
    

    Any solutions on how to get raw_input to read my return value from swedify()? i've tried from encodings import getencoder, getdecoder and others but nothing for the better.

  • Torxed
    Torxed over 12 years
    Correct, on a *nix system this would be useless, since my friends are not as enlightened as us lucky ones, they're using Windows 7 with different language packs and "default languages" which makes it tricky to get a good overall solution without 100 workarounds. As you mentioned, it does not take unicode strings which i probably should have figured out which i sort of did because i just moved the swedify() part out of the way and printed it along side with the raw_input which wasn't all to pritty but it works. raw_input(u'åäö>'.encode('iso-8859-15')) works sort of, gives odd letters tho.
  • Ray Toal
    Ray Toal over 12 years
    You should still be able to get things to work because Windows 7 should support UTF-8 for its console app. Remember that Python's raw_input uses the encoding of sys.stdin so if you can force that encoding to be UTF-8, and do the same for sys.stdout, will it work? Sorry I don't have a Windows 7 box to test this on.
  • Torxed
    Torxed over 12 years
    That will work, i remember seeing a solution where they used decode(encode(u'...')) with 'replace' some how, but i can't find it, but i know this solved a lot of problems. But forcing stdin will work yes so i'll mark the post as a solution, Windows is a work-around no matter what :) Cheers Ray!
  • Torxed
    Torxed over 12 years
    Just for the record, that doesn't help all to much. It only tells which encoding is expected within the file, it will not manage the actual output or input from say a socket or raw_input.
  • anarcat
    anarcat over 9 years
    sys.setdefaultencoding() is explicitely removed from Python3 and said to be "evil" elsewhere: ziade.org/2008/01/08/syssetdefaultencoding-is-evil - please do not use it.
  • Alastair McCormack
    Alastair McCormack over 8 years
    It's the holy grail of bodges
  • Alastair McCormack
    Alastair McCormack over 8 years
    @RayToal, the Windows console does not support UTF-8. There's a codepage that looks like it support UTF-8 but it's broken beyond belief and causes all kind of issues, especially around reading multi-byte input.
  • Alastair McCormack
    Alastair McCormack over 8 years
    The usefulness of this codepage is limited. It has limited character support and does not fix reading multi-byte characters
  • Ray Toal
    Ray Toal over 8 years
    Good to know. But it is hard to believe that one of the world's most popular operating systems chose to have a native terminal (console) application that does not deal with what is arguably the world's most popular encoding of Unicode. So the company behind the O.S. is fine to just leave "console support" to volunteers in the open source community to build support over the Console API? (If so, that strikes me as an example of trurth being stranger than fiction :) )