Writing Unicode text to a text file?

python unicode character-encoding python-2.x

349,626

Solution 1

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:

f = file('test', 'r')
print f.read().decode('utf8')

Solution 2

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

Solution 3

Unicode string handling is already standardized in Python 3.

char's are already stored in Unicode (32-bit) in memory

You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)

out1 = "(嘉南大圳 ㄐㄧㄚ　ㄋㄢˊ　ㄉㄚˋ　ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()

Solution 4

Preface: will your viewer work?

Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.

Writing Unicode text to a text file?

In Python 2, use open from the io module (this is the same as the builtin open in Python 3):

import io

Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).

encoding = 'utf-8'

utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.

On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).

encoding = 'utf-16le' # sorry, Windows users... :(

And just open it with the context manager and write your unicode characters out:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

Example using many Unicode characters

Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.

$ python uni.py

It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.

I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):

$ less unidata

e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So  ¶  PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd  ๙  THAI DIGIT NINE
  2887 So  ⢇  BRAILLE PATTERN DOTS-1238
  bc13 Lo  밓  HANGUL SYLLABLE MIH
  ffeb Sm  ￫  HALFWIDTH RIGHTWARDS ARROW

My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.

Solution 5

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)

You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

View more solutions

349,626

simon

Updated on September 14, 2020

Comments

simon over 3 years
I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).

It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?

Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
```
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
```
There is an encoding error on the last line:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 12286: ordinal not in range(128)

Partial solution:

This Python runs without an error:
```
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
```
But then if I open the actual text file, I see lots of symbols like:
```
Qur‚Äôan 
```
Maybe I need to write to something other than a text file?
- Thomas K about 13 years
  
  The program you're using to open it is not interpreting the UTF-8 text correctly. It should have an option to open the file as UTF-8.
simon about 13 years

Thanks. This runs without an error, but then if I open the text file, I see a bunch of weird symbols :) I need to copy and paste the text into a Wordpress page (don't ask). Is there any way I can actually print the symbols that are there? I guess not to a txt file, right, but maybe to something else?
quasistoic about 13 years

What are you using to open the text file? I'm guessing you're on Windows, and you're opening it in Notepad, which isn't too intelligent with encodings. What happens when you open it in Wordpad?
Richard Rast about 10 years

I was pretty excited about this answer, but it gives an error on my machine. When I copy/paste your code, I get an error: "TypeError: must be str, not bytes"
Liwen Zhao over 6 years

But this does not work on Python 2, right? (I should said, on this Python 3 code, it looks so concise and reasonable)
david m lee over 6 years

it should not work on Python 2. We stay on Python 3. 3 is so much better.
Georgy Gobozov about 6 years

Man, I spent so much time to find this! Thank you!
Hippo almost 6 years

This works for Python 3 too (obvious, but still worth pointing out).
Omar Cusma Fait over 4 years

@quasistoic where does the file method come form?
Benji about 4 years

I needed to turn binary mode on, i.e. f=open('test', 'wb'), as described in stackoverflow.com/a/5513856/6580199 - otherwise I would get "TypeError: write() argument must be str, not bytes"
Kerwin Sneijders over 3 years

This is THE answer. This is how you properly write utf-8 to a file, thanks!
Kerwin Sneijders about 3 years

This answer should probably include the open('filename', 'w', encoding='utf-8') from @david_n_lee's answer (For python 3)
bakalolo about 3 years

Doesn't work for me getting this error: TypeError: write() argument must be str, not bytes
Csaba Toth over 2 years

@KerwinSneijders the question is about Python 2.7, not Python 3
Kerwin Sneijders over 2 years

Python 2.x is no longer supported, more and more people will never use python 2 anymore and will find this question on SO when searching for a python 3 solution. And I don't think there should be 2 questions both for python 2 and 3, so because python 2.x is no longer supported, this should be the new accepted answer