Unicode (UTF-8) reading and writing to files in Python

892,830

Solution 1

Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io module (added in Python 2.6) provides an io.open function, which has an encoding parameter.

Use the open method from the io module.

>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")

Then after calling f's read() function, an encoded Unicode object is returned.

>>>f.read()
u'Capit\xe1l\n\n'

Note that in Python 3, the io.open function is an alias for the built-in open function. The built-in open function only supports the encoding argument in Python 3, not Python 2.

Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read() and readline(), so this answer now recommends the io module instead.

Use the open method from the codecs module.

>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")

Then after calling f's read() function, an encoded Unicode object is returned.

>>>f.read()
u'Capit\xe1l\n\n'

If you know the encoding of a file, using the codecs package is going to be much less confusing.

See http://docs.python.org/library/codecs.html#codecs.open

Solution 2

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')

[Edit on 2016-02-10 for requested clarification]

Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

Encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)

Solution 3

Actually, this worked for me for reading a file with UTF-8 encoding in Python 3.2:

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

Solution 4

So, I've found a solution for what I'm looking for, which is:

print open('f2').read().decode('string-escape').decode("utf-8")

There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.

This allows for the sort of round trip that I was imagining.

Solution 5

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()
Share:
892,830
Sakie
Author by

Sakie

Gregg Lind is a professional programmer living in Minneapolis, Minnesota, USA. I work for Mozilla on Test Pilot. Before that: Renesys, U of MN. Areas of interest: NoSQL, Literate programming, math phobia, gender gap in the hard sciences, information visualization, photography, snooty food, fixed-gear bicycles, Minneapolis, wind-power, dark beer, numbers, math and other kinky topics. likes: long walks on the beach, financially-secure men, flowers, data visualization, robot conspiracies, bicycles built for two. dislikes: eggplants, 2nd ring suburbs. Originally from Massachusetts, he stayed in the midwest after earning an undergrad degree in anthropology and biology from Grinnell. After a few years of shovelbumming, he found himself in Milwaukee, with weak prospects. After conning his way into a statistics job, and realizing that the people asking him for advice knew even less than he did about numbers, he decided to do the normal thing, and actually return to school, earning his M.S. from the U of Minnesota's School of Public Health's Division of Biostatistics in 2005. From there, it's been up, up, up, including a stint at the U of M Epidemiology Department where he worked on statistical genetics and statistical simulation projects.

Updated on July 08, 2022

Comments

  • Sakie
    Sakie almost 2 years

    I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).

    # The string, which has an a-acute in it.
    ss = u'Capit\xe1n'
    ss8 = ss.encode('utf8')
    repr(ss), repr(ss8)
    

    ("u'Capit\xe1n'", "'Capit\xc3\xa1n'")

    print ss, ss8
    print >> open('f1','w'), ss8
    
    >>> file('f1').read()
    'Capit\xc3\xa1n\n'
    

    So I type in Capit\xc3\xa1n into my favorite editor, in file f2.

    Then:

    >>> open('f1').read()
    'Capit\xc3\xa1n\n'
    >>> open('f2').read()
    'Capit\\xc3\\xa1n\n'
    >>> open('f1').read().decode('utf8')
    u'Capit\xe1n\n'
    >>> open('f2').read().decode('utf8')
    u'Capit\\xc3\\xa1n\n'
    

    What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?

    What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

    >>> print simplejson.dumps(ss)
    '"Capit\u00e1n"'
    >>> print >> file('f3','w'), simplejson.dumps(ss)
    >>> simplejson.load(open('f3'))
    u'Capit\xe1n'
    
  • Sakie
    Sakie about 15 years
    I think there are some pieces missing here: the file f2 contains: hex: 0000000: 4361 7069 745c 7863 335c 7861 316e 0a Capit\xc3\xa1n. codecs.open('f2','rb', 'utf-8') , for example, reads them all in a separate chars (expected) Is there any way to write to a file in ascii that would work?
  • Matt Connolly
    Matt Connolly about 13 years
    Works perfectly for writing files too, instead of open(file,'w') do codecs.open(file,'w','utf-8') solved
  • nay
    nay over 11 years
    Saved my day too. Thank you so much
  • try-catch-finally
    try-catch-finally about 11 years
    Does the codecs.open(...) method also fully conform to the with open(...): style, where the with cares about closing the file after all is done? It seems to work anyway.
  • Eagle
    Eagle almost 11 years
    Good response, I've been tested both solutions (codecs.open(file,"r","utf-8") and simply open(file,"r").read().decode("utf-8") and both worked perfectly.
  • scubadivingfool
    scubadivingfool almost 11 years
    @try-catch-finally Yes. I use with codecs.open(...) as f: all the time.
  • Mike Girard
    Mike Girard almost 11 years
    I wish I could upvote this a hundred times. After agonizing for several days over encoding issues caused by a lot of mixed data and going cross-eyed reading about encoding, this answer is like water in a desert. Wish I'd seen it sooner.
  • abarisone
    abarisone about 8 years
    Could you please elaborate more your answer adding a little more description about the solution you provide?
  • Taylor D. Edmiston
    Taylor D. Edmiston over 7 years
    It looks this is available in python 2 using the codecs module - codecs.open('somefile', encoding='utf-8') stackoverflow.com/a/147756/149428
  • JinSnow
    JinSnow over 7 years
    I'm getting a "TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper" any idea why?
  • Jacquot
    Jacquot about 7 years
    I think, considering the number of upvotes, it would be a great idea to accept the second answer :)
  • personal_cloud
    personal_cloud over 6 years
  • personal_cloud
    personal_cloud over 6 years
  • scubadivingfool
    scubadivingfool over 6 years
    Thanks for the tip @personal_cloud I'll update the answer.
  • Evan Hu
    Evan Hu over 6 years
    Yes, using io is better; But I wrote the with statement like this with io.open('data.txt', 'w', 'utf-8') as file: and got an error: TypeError: an integer is required. After I changed to with io.open('data.txt', 'w', encoding='utf-8') as file: and it worked.
  • Pat Grady
    Pat Grady about 6 years
    Great catch! I was trying to clean up code downstream; I went straight to the source of the problem with io.open(filename,'r',encoding='utf-8') as file:
  • Perry
    Perry over 4 years
    Use encoding="utf-8-sig" if there's any chance your file will have a BOM (works in Python 2.7)