Convert from ANSI to UTF-8
Solution 1
Why don't you read the file and write it as UTF-8? You can do that in Python.
#to support encodings
import codecs
#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
lines = file.read()
#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
file.write(lines)
Solution 2
I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.
I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.
The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:
blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
while True:
contents = sourceFile.read(blockSize)
if not contents:
break
targetFile.write(contents)
The below link contains some information on the encoding types that I found on my research
https://docs.python.org/2.4/lib/standard-encodings.html
Karan Pappala
Updated on July 12, 2022Comments
-
Karan Pappala almost 2 years
I have around 600,000 files encoded inANSI
and I want to convert them toUTF-8
. I can do that individually inNOTEPAD++
, but i can't do that for 600,000 files.Can i do this inR
orPython
?I have found this link but the
Python
script is not running: notepad++ converting ansi encoded file to utf-8-
Klaus D. almost 9 yearsWith "ANSI" you are meaning "Windows-1252" encoding?
-
Karan Pappala almost 9 years@KlausD. It shows as ANSI in Notepad++
-
Roland almost 9 yearsPlease note that "lakh" is not a word from standard (US/UK == international) English. Many people outside your corner of the world don't know what a "lakh" is.
-
hd1 almost 9 yearsANSI is not an encoding. And what is "lakh"?
-
Roland almost 9 years@hd1 lakh is a southern asian word for 100k.
-
Marichyasana almost 9 yearsYou can try using the R function iconv()
-
-
Tensibai almost 9 yearsWon't reading as utf-8 a different codepage loose some characters ? (I had though you have to read with the correct codepage before writing in another codepage) ?
-
3Ducker almost 9 yearsFrom Python specs: Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.
-
ConductedForce about 3 yearsI had to change the read encoding to 'cp1252' to get it to work for me. It still opens with UTF-8 otherwise which gave me an error when encountering a mixed file:
'utf-8' codec can't decode byte 0x92