How to open an ascii-encoded file as UTF8?

21,950

Solution 1

You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).

You need to decode the byte-string explicitly, using the .decode() function:

 template_str = template_str.decode('utf8')

Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.

Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.

Solution 2

A solution working in Python2:

import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read()  ## returns unicode
assert type(content) == unicode
fo.close()

utf8_content = content.encode('utf-8')
assert type(utf8_content) == str

Solution 3

I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.

"How do I get it to load as UTF8?"

I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.

You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

Share:
21,950
Jesvin Jose
Author by

Jesvin Jose

Tech lead at Marlabs, Kochi. Loves Sqlalchemy, Django, Vue. Admin of many Telegram coding groups. Memorised several passages by Marcus Aurelius.

Updated on July 07, 2020

Comments

  • Jesvin Jose
    Jesvin Jose almost 4 years

    My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?

    The problem I am tring to solve is:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
    

    I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.

    Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?

  • Jesvin Jose
    Jesvin Jose over 11 years
    Yes, file -bi returns charset=us-ascii for encoding.
  • Dr. Jan-Philip Gehrcke
    Dr. Jan-Philip Gehrcke over 11 years
    Now that you added more information, you see that your files actually are not ASCII-encoded.
  • Jesvin Jose
    Jesvin Jose over 11 years
    Turns out a common ® (learned it after it ran correctly) can crash the code. I was trying to treat it like a bug as it was 20:00 here and I was annoyed. I will read those and recommend it to the whole team. I owe you!