UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)

python file utf-8 byte-order-mark

18,803

Solution 1

Since you state:

All of my (text) files are currently stored in UTF-8 with the BOM

then use the 'utf-8-sig' codec to decode them:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works correctly if the BOM is not present as well.

Solution 2

Check the first character after decoding to see if it's the BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]

Solution 3

The previously-accepted answer is WRONG.

u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

Here's a correct and typo/braino-resistant answer:

Decode your input into unicode_str. Then do this:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

Update Unfortunately there seems to be no suitable named constant in the standard Python library.

Alas, the codecs module provides only "a snare and a delusion":

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

Here verbatim unprettified from my own code is my solution to this:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

18,803

Author by

Cameron

I love programming :-)

Updated on June 18, 2022

Comments

Cameron almost 2 years

First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.

When I retrieve the templates from the DB, I decode them using template.decode('utf-8'). When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:

Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.

Chrome seems to generate an <html> tag automatically when it sees the BOM and mistakes it for content, making the real <html> tag an error.

So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?

For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8').

Note: I am using Python 2.5.

Thanks!
Konrad Rudolph about 14 years

Hmm, don't you have to rewind the file after reading the first 4 bytes and before testing for BOMs? f.seek(0).
pajton about 14 years

@Konrad I missed that, thanks for pointing out. This is not production code anyway:].
Cameron about 14 years

Looks good to me (with the seek(0) fix), but I've already got the entire file in memory when I'm trying to chop the BOM -- how efficient is contents[2:] (for example) in Python? Does it create a copy of the entire string?
Cameron about 14 years

Will u'\ufffe' ever occur at the beginning of a non-UTF-8 file? Wouldn't the BOM take two "characters" (read: bytes) in my case (UTF-8)?
Ignacio Vazquez-Abrams about 14 years

u'\ufffe' may be found at the beginning of any UTF- or UCS-encoded file. The BOM is three bytes in UTF-8, but it's still a single Unicode codepoint.
Cameron about 14 years

OK, so just to get this straight, I'd need to first decode the byte-content of the file using u = contents.decode('utf-8') and then I'd be able to use your method because the BOM is a single codepoint. Correct?
Cameron about 14 years

I'd use this method if I was stripping the BOM while reading the file, but I'll be stripping the BOM with the file already in memory. Thanks for your reply though!
Ignacio Vazquez-Abrams about 14 years

@John: Calling getting the numbers mixed around "utterly wrong" is just slightly melodramatic, don't you think?
Cameron about 14 years

I fail to see how "ZERO WIDTH NO-BREAK SPACE", used here because it is also the BOM (pun intended), is any more legible than u"\uFEFF". They both require prior knowledge about the BOM to be understood.
Cameron about 14 years

@Ignacio: I still think this answer is the best for my circumstances, however I suggest you edit your answer to use u'\ufeff' instead. It seems to be the correct order (when using the Unicode codepoint -- the order of the actual encoded bytes depends, which is the whole point of the BOM).
John Machin about 14 years

This answer also has problems When reading a file, you need to check for FIVE (at least) possible BOMs, not three. See my answer.
John Machin about 14 years

@Cameron: The legibility comes from giving whatever constant you use a name e.g. UNICODE_BOM.
Vicki Laidler about 14 years

@Cameron: I know nothing about the BOM, but I have a sense what a "zero width no-break space" is, and no idea what a u"\uFEFF" is. The latter is also harder to be sure that I've typed correctly, since its 8 character-length consists of only 3 alphanumeric characters, two of which closely resemble each other.
Cameron about 14 years

@Vickie: In this context, the "zero width no-break space" is not being used to represent a zero width no-break space at all (it's purpose it completely different -- look up BOM if you're curious), which is why I find it equally unhelpful to use it by name instead of by codepoint. @John: You're right, it's a good idea to use a symbolic name (like a constant) instead of the codepoint directly.
John Machin about 14 years

@Cameron: The point of using the \N constant for ZWNBSP is that if you accidentally "mess up the order" you will get a SyntaxError immediately. The original purpose of ZWNBSP is now deprecated; "The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM". Unfortunately the unicodedata file doesn't include a mapping that would allow anything like u"\N{BYTE ORDER MARK}".
John Machin about 14 years

@Ignacio: The effect of "getting the numbers mixed around" was to produce not the BOM but the AntiBOM -- utterly wrong, just like confusing Christ and the Antichrist. Before mucking about with ordnance, it's a good idea to read the instructions carefully c.f. the Holy Hand Grenade of Antioch.
Cameron about 14 years

Ooh! Very nice! I'll try this as soon as I can.
Cameron about 14 years

Works beautifully (although Chrome mysteriously stopped giving the error no matter what, even with my old (wrong) code -- that's what I get for doing a whole bunch of changes at once).