Python3: Convert Latin-1 to UTF-8

python encoding utf-8 python-3.5

12,166

Solution 1

I have found a half-part way in this. This is not what you want / need, but might help others in the right direction...

# First read the file
txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append
items = txt.readlines()
txt.close()

# and write the changes to file
output = open("file_name", "w", encoding="utf-8")
for string_fin in items:
    if "Ã©" in string_fin:
        string_fin = string_fin.replace("Ã©", "é")

    if "Ã«" in string_fin:
        string_fin = string_fin.replace("Ã«", "ë")

    # this works if not to much needs changing...

    output.write(string_fin)

output.close();

*note for detection

Solution 2

For python 3.6:

your_str = your_str.encode('utf-8').decode('latin-1')

12,166

Author by

I.P.

Updated on July 20, 2022

Comments

I.P. almost 2 years

My code looks like the following:

for file in glob.iglob(os.path.join(dir, '*.txt')):
    print(file)
    with codecs.open(file,encoding='latin-1') as f:
        infile = f.read()

with codecs.open('test.txt',mode='w',encoding='utf-8') as f:
    f.write(infile)

The files I work with are encoded in Latin-1 (I could not open them in UTF-8 obviously). But I want to write the resulting files in utf-8.

But this:

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español">
<Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida">
<Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>

Instead becomes this (in gedit):

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ਀㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀㄀㄀开　㜀

If I print it on the Terminal, it shows up normal.

Even more confusing is what I get when I open the resulting file with LibreOffice Writer:

<#T#r#a#n#s# (and so on)

So how do I properly convert a latin-1 string to a utf-8 string? In python2, it's easy, but in python3, it seems confusing to me.

I tried already these in different combinations:

#infile = bytes(infile,'utf-8').decode('utf-8')
#infile = infile.encode('utf-8').decode('utf-8')
#infile = bytes(infile,'utf-8').decode('utf-8')

But somehow I always end up with the same weird output.

Thanks in advance!

Edit: This question is different to the questions linked in the comment, as it concerns Python 3, not Python 2.7.

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Python: Convert utf-8 string to byte string

how to remove non utf 8 code and save as a csv file python

How to determine the encoding of a CSV file?

Python 3 unicode to utf-8 on file

How do I convert unicode string with cp1252 characters into UTF-8 with Python?

Python 3: Read UTF-8 file containing German umlaut

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0

"surrogateescape" cannot escape certain characters

json.dumps \u escaped unicode to utf8

Converting UTF-16 to UTF-8