How to convert a file to utf-8 in Python?

133,472

Solution 1

You can use the codecs module, like this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

EDIT: added BLOCKSIZE parameter to control file chunk size.

Solution 2

This worked for me in a small test:

sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
source = open("source")
target = open("target", "w")

target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))

Solution 3

Thanks for the replies, it works!

And since the source files are in mixed formats, I added a list of source formats to be tried in sequence (sourceFormats), and on UnicodeDecodeError I try the next format:

from __future__ import with_statement

import os
import sys
import codecs
from chardet.universaldetector import UniversalDetector

targetFormat = 'utf-8'
outputDir = 'converted'
detector = UniversalDetector()

def get_encoding_type(current_file):
    detector.reset()
    for line in file(current_file):
        detector.feed(line)
        if detector.done: break
    detector.close()
    return detector.result['encoding']

def convertFileBestGuess(filename):
   sourceFormats = ['ascii', 'iso-8859-1']
   for format in sourceFormats:
     try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
      except UnicodeDecodeError:
        pass

def convertFileWithDetection(fileName):
    print("Converting '" + fileName + "'...")
    format=get_encoding_type(fileName)
    try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
    except UnicodeDecodeError:
        pass

    print("Error: failed to convert '" + fileName + "'.")


def writeConversion(file):
    with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile:
        for line in file:
            targetFile.write(line)

# Off topic: get the file list and call convertFile on each file
# ...

(EDIT by Rudro Badhon: this incorporates the original try multiple formats until you don't get an exception as well as an alternate approach that uses chardet.universaldetector)

Solution 4

Answer for unknown source encoding type

based on @Sébastien RoccaSerra

python3.6

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

from_codec = get_encoding_type(srcfile)

# add try: except block for reliability
try: 
    with open(srcfile, 'r', encoding=from_codec) as f, open(trgfile, 'w', encoding='utf-8') as e:
        text = f.read() # for small files, for big use chunks
        e.write(text)

    os.remove(srcfile) # remove old encoding file
    os.rename(trgfile, srcfile) # rename new encoding
except UnicodeDecodeError:
    print('Decode Error')
except UnicodeEncodeError:
    print('Encode Error')

Solution 5

You can use this one liner (assuming you want to convert from utf16 to utf8)

    python -c "from pathlib import Path; path = Path('yourfile.txt') ; path.write_text(path.read_text(encoding='utf16'), encoding='utf8')"

Where yourfile.txt is a path to your $file.

For this to work you need python 3.4 or newer (probably nowadays you do).

Below a more readable version of the code above

from pathlib import Path
path = Path("yourfile.txt")
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")
Share:
133,472
Sébastien RoccaSerra
Author by

Sébastien RoccaSerra

I've been a professional programer since 2000. Depending on my tasks, I code in Java and Javascript or C++ and Lua everyday. I also enjoy coding in Lisp and Smalltalk every week or so, and automate tasks with Ruby or Python scripts. At the moment I'm reading Functional JavaScript and To Mock a Mockingbird.

Updated on July 05, 2022

Comments

  • Sébastien RoccaSerra
    Sébastien RoccaSerra almost 2 years

    I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part.

    I'd like to do the equivalent of:

    iconv -t utf-8 $file > converted/$file # this is shell code
    

    Thanks!

  • Brian
    Brian over 15 years
    read() will always read the whole file - you probably want .read(BLOCKSIZE), where BLOCKSIZE is some suitable amount to read/write at once.
  • itsadok
    itsadok over 15 years
    For tough cases you can try to detect encoding with the chardet module from feedparser.org, but in your case it's an overkill.
  • Arafangion
    Arafangion about 13 years
    Even better would be to specify binary mode.
  • Honghe.Wu
    Honghe.Wu about 10 years
    @Arafangion Why binary mode would be better? Thanks!
  • Arafangion
    Arafangion about 10 years
    @Honghe.Wu: On windows, text mode is the default, and that means that your line endings will be mangled by the operating system, something you don't want if you're unsure about the encoding on disk.
  • The Bndr
    The Bndr about 9 years
    @Arafangion How would the example look like, if I like to specify binary mode? target = open("target", "wb") are there some more changes?
  • physicalattraction
    physicalattraction over 7 years
    My Python 3.5 doesn't recognize the function file. Where does that come from?
  • Sébastien RoccaSerra
    Sébastien RoccaSerra over 7 years
    Yes, this answer was posted 8 years ago, so it's a piece of old Python 2 code.
  • Arthur Julião
    Arthur Julião over 7 years
    It does not answer the question.
  • beep_check
    beep_check about 4 years
    this worked great for converting from is0-8859-1 to utf-8!
  • Rafael-WO
    Rafael-WO almost 3 years
    When in Python 3: Consider using open instead of codecs.open (see here)
  • david
    david almost 3 years
    Depending on your operating system this may change the line break control characters. Great answer nevertheless, thank you. It needs more upvotes. Simple as that and no need to care about managing resources according to the documentation of Path.write_text: Open the file in text mode, write to it, and close the file.
  • fskoras
    fskoras about 2 years
    Instead "line[:-1]" it would be better to use line.rstrip('\r\n'). This way no matter what line ending you encounter you will get correct results.