Convert from ANSI to UTF-8

python r notepad++

22,147

Solution 1

Why don't you read the file and write it as UTF-8? You can do that in Python.

#to support encodings
import codecs

#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
  lines = file.read()

#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
  file.write(lines)

Solution 2

I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.

I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.

The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:

blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
    with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
        while True:
            contents = sourceFile.read(blockSize)
            if not contents:
                break
            targetFile.write(contents)

The below link contains some information on the encoding types that I found on my research

https://docs.python.org/2.4/lib/standard-encodings.html

22,147

Author by

Karan Pappala

Updated on July 12, 2022

Comments

Karan Pappala almost 2 years

I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?

I have found this link but the Python script is not running: notepad++ converting ansi encoded file to utf-8
- Klaus D. almost 9 years
  
  With "ANSI" you are meaning "Windows-1252" encoding?
- Karan Pappala almost 9 years
  
  @KlausD. It shows as ANSI in Notepad++
- Roland almost 9 years
  
  Please note that "lakh" is not a word from standard (US/UK == international) English. Many people outside your corner of the world don't know what a "lakh" is.
- hd1 almost 9 years
  
  ANSI is not an encoding. And what is "lakh"?
- Roland almost 9 years
  
  @hd1 lakh is a southern asian word for 100k.
- Marichyasana almost 9 years
  
  You can try using the R function iconv()
Tensibai almost 9 years

Won't reading as utf-8 a different codepage loose some characters ? (I had though you have to read with the correct codepage before writing in another codepage) ?
3Ducker almost 9 years

From Python specs: Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.
ConductedForce about 3 years

I had to change the read encoding to 'cp1252' to get it to work for me. It still opens with UTF-8 otherwise which gave me an error when encountering a mixed file: 'utf-8' codec can't decode byte 0x92