Write to UTF-8 file in Python

python utf-8 character-encoding byte-order-mark

401,671

Solution 1

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

(That seems to give the right answer - a file with bytes EF BB BF.)

EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.

Solution 2

Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

Do this

with codecs.open("test_output", "w", "utf-8-sig") as temp:
    temp.write("hi mom\n")
    temp.write(u"This has ♭")

The resulting file is UTF-8 with the expected BOM.

Solution 3

It is very simple just use this. Not any library needed.

with open('text.txt', 'w', encoding='utf-8') as f:
    f.write(text)

Solution 4

@S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.

Jon Skeet is right (unusual) about the codecs module - it contains byte strings:

>>> import codecs
>>> codecs.BOM
'\xff\xfe'
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>>

Picking another nit, the BOM has a standard Unicode name, and it can be entered as:

>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'\ufeff'

It is also accessible via unicodedata:

>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'\ufeff'
>>>

Solution 5

I use the file *nix command to convert a unknown charset file in a utf-8 file

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

View more solutions

401,671

John Jiang

Updated on April 09, 2022

Comments

John Jiang about 2 years
I'm really confused with the codecs.open function. When I do:
```
file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()
```
It gives me the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

If I do:
```
file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()
```
It works fine.

Question is why does the first method fail? And how do I insert the bom?

If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?
- tchrist about 12 years
  
  Don’t use a BOM in UTF-8. Please.
- Salman von Abbas almost 11 years
  
  @tchrist Huh? Why not?
- Alois Mahdal over 10 years
  
  @SalmanPK BOM is not needed in UTF-8 and only adds complexity (e.g. you can't just concatenate BOM'd files and result with valid text). See this Q&A; don't miss the big comment under Q
Apache over 10 years

Warning: open and open is not the same. If you do "from codecs import open", it will NOT be the same as you would simply type "open".
Mohamad Fakih over 10 years

Thanks. That worked (Windows 7 x64, Python 2.7.5 x64). This solution works well when you open the file in mode "a" (append).
beta-closed over 7 years

you can also use codecs.open('test.txt', 'w', 'utf-8-sig') instead
show0k about 7 years

Use # coding: utf8 instead of # -*- coding: utf-8 -*-which is far easier to remember.
Dustin Andrews over 6 years

This didn't work for me, Python 3 on Windows. I had to do this instead with open(file_name, 'wb') as bomfile: bomfile.write(codecs.BOM_UTF8) then re-open the file for append.
Mugen about 6 years

I'm getting "TypeError: an integer is required (got type str)". I don't understand what we're doing here. Can someone please help? I need to append a string (paragraph) to a text file. Do I need to convert that into an integer first before writing?
Jon Skeet about 6 years

@Mugen: The exact code I've written works fine as far as I can see. I suggest you ask a new question showing exactly what code you've got, and where the error occurs.
northben almost 6 years

@Mugen you need to call codecs.open instead of just open
user2905353 over 4 years

Maybe add temp.close() ?
matheburg about 4 years

@user2905353: not required; this is handled by context management of open.
paradox almost 3 years

I am really interested in seing something like that working on windows