How to convert a file to utf-8 in Python?
Solution 1
You can use the codecs module, like this:
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
EDIT: added BLOCKSIZE
parameter to control file chunk size.
Solution 2
This worked for me in a small test:
sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
source = open("source")
target = open("target", "w")
target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
Solution 3
Thanks for the replies, it works!
And since the source files are in mixed formats, I added a list of source formats to be tried in sequence (sourceFormats
), and on UnicodeDecodeError
I try the next format:
from __future__ import with_statement
import os
import sys
import codecs
from chardet.universaldetector import UniversalDetector
targetFormat = 'utf-8'
outputDir = 'converted'
detector = UniversalDetector()
def get_encoding_type(current_file):
detector.reset()
for line in file(current_file):
detector.feed(line)
if detector.done: break
detector.close()
return detector.result['encoding']
def convertFileBestGuess(filename):
sourceFormats = ['ascii', 'iso-8859-1']
for format in sourceFormats:
try:
with codecs.open(fileName, 'rU', format) as sourceFile:
writeConversion(sourceFile)
print('Done.')
return
except UnicodeDecodeError:
pass
def convertFileWithDetection(fileName):
print("Converting '" + fileName + "'...")
format=get_encoding_type(fileName)
try:
with codecs.open(fileName, 'rU', format) as sourceFile:
writeConversion(sourceFile)
print('Done.')
return
except UnicodeDecodeError:
pass
print("Error: failed to convert '" + fileName + "'.")
def writeConversion(file):
with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile:
for line in file:
targetFile.write(line)
# Off topic: get the file list and call convertFile on each file
# ...
(EDIT by Rudro Badhon: this incorporates the original try multiple formats until you don't get an exception as well as an alternate approach that uses chardet.universaldetector)
Solution 4
Answer for unknown source encoding type
based on @Sébastien RoccaSerra
python3.6
import os
from chardet import detect
# get file encoding type
def get_encoding_type(file):
with open(file, 'rb') as f:
rawdata = f.read()
return detect(rawdata)['encoding']
from_codec = get_encoding_type(srcfile)
# add try: except block for reliability
try:
with open(srcfile, 'r', encoding=from_codec) as f, open(trgfile, 'w', encoding='utf-8') as e:
text = f.read() # for small files, for big use chunks
e.write(text)
os.remove(srcfile) # remove old encoding file
os.rename(trgfile, srcfile) # rename new encoding
except UnicodeDecodeError:
print('Decode Error')
except UnicodeEncodeError:
print('Encode Error')
Solution 5
You can use this one liner (assuming you want to convert from utf16 to utf8)
python -c "from pathlib import Path; path = Path('yourfile.txt') ; path.write_text(path.read_text(encoding='utf16'), encoding='utf8')"
Where yourfile.txt
is a path to your $file.
For this to work you need python 3.4 or newer (probably nowadays you do).
Below a more readable version of the code above
from pathlib import Path
path = Path("yourfile.txt")
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")
Sébastien RoccaSerra
I've been a professional programer since 2000. Depending on my tasks, I code in Java and Javascript or C++ and Lua everyday. I also enjoy coding in Lisp and Smalltalk every week or so, and automate tasks with Ruby or Python scripts. At the moment I'm reading Functional JavaScript and To Mock a Mockingbird.
Updated on July 05, 2022Comments
-
Sébastien RoccaSerra almost 2 years
I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part.
I'd like to do the equivalent of:
iconv -t utf-8 $file > converted/$file # this is shell code
Thanks!
-
Brian over 15 yearsread() will always read the whole file - you probably want .read(BLOCKSIZE), where BLOCKSIZE is some suitable amount to read/write at once.
-
itsadok over 15 yearsFor tough cases you can try to detect encoding with the chardet module from feedparser.org, but in your case it's an overkill.
-
Arafangion about 13 yearsEven better would be to specify binary mode.
-
Honghe.Wu about 10 years@Arafangion Why binary mode would be better? Thanks!
-
Arafangion about 10 years@Honghe.Wu: On windows, text mode is the default, and that means that your line endings will be mangled by the operating system, something you don't want if you're unsure about the encoding on disk.
-
The Bndr about 9 years@Arafangion How would the example look like, if I like to specify binary mode?
target = open("target", "wb")
are there some more changes? -
physicalattraction over 7 yearsMy Python 3.5 doesn't recognize the function
file
. Where does that come from? -
Sébastien RoccaSerra over 7 yearsYes, this answer was posted 8 years ago, so it's a piece of old Python 2 code.
-
Arthur Julião over 7 yearsIt does not answer the question.
-
beep_check about 4 yearsthis worked great for converting from is0-8859-1 to utf-8!
-
Rafael-WO almost 3 years
-
david almost 3 yearsDepending on your operating system this may change the line break control characters. Great answer nevertheless, thank you. It needs more upvotes. Simple as that and no need to care about managing resources according to the documentation of Path.write_text:
Open the file in text mode, write to it, and close the file.
-
fskoras about 2 yearsInstead "line[:-1]" it would be better to use line.rstrip('\r\n'). This way no matter what line ending you encounter you will get correct results.