Convert from ANSI to UTF-8

22,147

Solution 1

Why don't you read the file and write it as UTF-8? You can do that in Python.

#to support encodings
import codecs

#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
  lines = file.read()

#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
  file.write(lines)

Solution 2

I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.

I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.

The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:

blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
    with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
        while True:
            contents = sourceFile.read(blockSize)
            if not contents:
                break
            targetFile.write(contents)

The below link contains some information on the encoding types that I found on my research

https://docs.python.org/2.4/lib/standard-encodings.html

Share:
22,147
Karan Pappala
Author by

Karan Pappala

Updated on July 12, 2022

Comments

  • Karan Pappala
    Karan Pappala almost 2 years


    I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?

    I have found this link but the Python script is not running: notepad++ converting ansi encoded file to utf-8

    • Klaus D.
      Klaus D. almost 9 years
      With "ANSI" you are meaning "Windows-1252" encoding?
    • Karan Pappala
      Karan Pappala almost 9 years
      @KlausD. It shows as ANSI in Notepad++
    • Roland
      Roland almost 9 years
      Please note that "lakh" is not a word from standard (US/UK == international) English. Many people outside your corner of the world don't know what a "lakh" is.
    • hd1
      hd1 almost 9 years
      ANSI is not an encoding. And what is "lakh"?
    • Roland
      Roland almost 9 years
      @hd1 lakh is a southern asian word for 100k.
    • Marichyasana
      Marichyasana almost 9 years
      You can try using the R function iconv()
  • Tensibai
    Tensibai almost 9 years
    Won't reading as utf-8 a different codepage loose some characters ? (I had though you have to read with the correct codepage before writing in another codepage) ?
  • 3Ducker
    3Ducker almost 9 years
    From Python specs: Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.
  • ConductedForce
    ConductedForce about 3 years
    I had to change the read encoding to 'cp1252' to get it to work for me. It still opens with UTF-8 otherwise which gave me an error when encountering a mixed file: 'utf-8' codec can't decode byte 0x92