Encoding characters with ISO 8859-1 in Python

21,268

Solution 1

When you're starting with a Unicode string, you need to encode rather than decode.

>>> def char_code(c):
        return ord(c.encode('iso-8859-1'))

>>> print char_code(u'à')
224

For ISO-8859-1 in particular, you don't even need to encode it at all, since Unicode uses the ISO-8859-1 characters for its first 256 code points.

>>> print ord(u'à')
224

Edit: I see the problem now. You've given a source code encoding comment that indicates the source is in ISO-8859-1. However, I'll bet that your editor is actually working in UTF-8. The source code will be mis-interpreted, and the single-character string you think you created will actually be two characters. Try the following to see:

print len(u'à')

If your encoding is correct, it will return 1, but in your case it's probably 2.

Solution 2

You can get ord() for anything. As you might expect, ord(u'💩') works fine, provided you can represent the character properly in your source, and/or read it in a known encoding.

Your error message vaguely suggests that coding: iso-8859-1 is not actually true, and the file's encoding is actually something else (UTF-8 or UTF-16 would be my guess).

The canonical must-read on character encoding in Python is http://nedbatchelder.com/text/unipain.html

Share:
21,268
Drimades Boy
Author by

Drimades Boy

Updated on August 21, 2020

Comments

  • Drimades Boy
    Drimades Boy over 3 years

    With ord(ch) you can get a numerical code for character ch up to 127. Is there any function that returns a number from 0-255, so to cover also ISO 8859-1 characters?
    Edit: Follows my last version of code and error I get

    #!/usr/bin/python
    # coding: iso-8859-1
    
    import sys
    reload(sys)
    sys.setdefaultencoding('iso-8859-1')
    print sys.getdefaultencoding()  # prints "iso-8859-1" 
    
    def char_code(c):
        return ord(c.encode('iso-8859-1'))
    print char_code(u'à')
    

    I get an error: TypeError: ord() expected a character, but string of length 2 found

  • Drimades Boy
    Drimades Boy over 8 years
    Using print char_code(u'💩') I get: Non-ASCII character '\xf0' in file unicode.py on line 4, but no encoding declared;
  • Rafael Telles
    Rafael Telles over 8 years
    This character does not exists in ISO-8859-1, check the table.
  • Rafael Telles
    Rafael Telles over 8 years
    And you should specify an encoding header.
  • tripleee
    tripleee over 8 years
    The error message suggests the coding: header is wrong. If you declare ISO-8859-1 encoding, but the actual encoding of the file is UTF-8 (or UTF16) that's the error message you would expect.
  • tripleee
    tripleee over 8 years
    Maybe see the character-encoding tag wiki for some hints.
  • Drimades Boy
    Drimades Boy over 8 years
    I tried both ways you suggest,but I still get the same error.
  • Mark Ransom
    Mark Ransom over 8 years
    @DrimadesBoy then your example is incorrect, please update it with code that actually demonstrates the error.
  • Drimades Boy
    Drimades Boy over 8 years
    Solved. I'm using Geany in Ubuntu and changed the file encoding from 'utf-8' to 'iso-8859-1' from Document > Set Encoding > Western European > ISO-8859-1
  • Mark Ransom
    Mark Ransom over 8 years
    @DrimadesBoy if it's solved, please use the checkbox so everybody knows it. And an upvote would be nice.