Writing and then reading a string in file encoded in latin1

19,174

Your data was written out as UTF-8:

>>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1')
'On écrit ça dans un fichier.'

This either means you did not write out Latin-1 data, or your source code was saved as UTF-8 but you declared your script (using a PEP 263-compliant header to be Latin-1 instead.

If you saved your Python script with a header like:

# -*- coding: latin-1 -*-

but your text editor saved the file with UTF-8 encoding instead, then the string literal:

s='On écrit ça dans un fichier.'

will be misinterpreted by Python as well, in the same manner. Saving the resulting unicode value to disk as Latin-1, then reading it again as Latin-1 will preserve the error.

To debug, please take a close look at print(s.encode('unicode_escape')) in the first script. If it looks like:

b'On \\xc3\\xa9crit \\xc3\\xa7a dans un fichier.'

then your source code encoding and the PEP-263 header are disagreeing on how the source code should be interpreted. If your source code is correctly decoded the correct output is:

b'On \\xe9crit \\xe7a dans un fichier.'

If Spyder is stubbornly ignoring the PEP-263 header and reading your source as Latin-1 regardless, avoid using non-ASCII characters and use escape codes instead; either using \uxxxx unicode code points:

s = 'On \u00e9crit \u007aa dans un fichier.'

or \xaa one-byte escape codes for code-points below 256:

s = 'On \xe9crit \x7aa dans un fichier.'
Share:
19,174
François Coulombeau
Author by

François Coulombeau

Maths and computer science teacher in french "Classes préparatoires aux grandes écoles".

Updated on June 14, 2022

Comments

  • François Coulombeau
    François Coulombeau over 1 year

    Here are 2 code samples, Python3 : the first one writes two files with latin1 encoding :

    s='On écrit ça dans un fichier.'
    with open('spam1.txt', 'w',encoding='ISO-8859-1') as f:
        print(s, file=f)
    with open('spam2.txt', 'w',encoding='ISO-8859-1') as f:
        f.write(s)
    

    The second one reads the same files with the same encoding :

    with open('spam1.txt', 'r',encoding='ISO-8859-1') as f:
        s1=f.read()
    with open('spam2.txt', 'r',encoding='ISO-8859-1') as f:
        s2=f.read()
    

    Now, printing s1 and s2 I get

    On écrit ça dans un fichier.
    

    instead of the initial "On écrit ça dans un fichier."

    What is wrong ? I also tried with io.open but I miss something. The funny part is that I had no such problem with Python2.7 and its str.decode method which is now gone...

    Could someone help me ?