u'\ufeff' in Python string

252,537

Solution 1

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won't know if the data is big- or little-endian.

Solution 2

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

Just my 2 cents.

Solution 3

That character is the BOM or "Byte Order Mark". It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to 'ascii', you should probably pick another encoding for whatever you were trying to do.

Solution 4

The content you're scraping is encoded in unicode rather than ascii text, and you're getting a character that doesn't convert to ascii. The right 'translation' depends on what the original web page thought it was. Python's unicode page gives the background on how it works.

Are you trying to print the result or stick it in a file? The error suggests it's writing the data that's causing the problem, not reading it. This question is a good place to look for the fixes.

Share:
252,537

Related videos on Youtube

James Hallen
Author by

James Hallen

Updated on January 02, 2021

Comments

  • James Hallen
    James Hallen about 3 years

    I got an error with the following exception message:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
    position 155: ordinal not in range(128)
    

    Not sure what u'\ufeff' is, it shows up when I'm web scraping. How can I remedy the situation? The .replace() string method doesn't work on it.

    • Admin
      Admin over 10 years
      Where does this input come from? What are you tryring to do? Please include your Python code.
    • Doug Bradshaw
      Doug Bradshaw over 7 years
      Incidentally, I find that the .replace() does work in modern python if I remember the unicode indicator: s.replace(u'\ufeff','')
  • neurino
    neurino almost 5 years
    Thank you, this is the actual solution and should be the accepted answer. While this is great insight on why the string is there, most of the people coming here is looking for a straightforward solution and this is it.
  • Luke
    Luke over 4 years
    Had this same issue with the csv DictReader reading a csv file saved from Excel.
  • osprey
    osprey almost 4 years
    Yes, Excel (even "csv" generated by Excel) really is a hot mess.
  • MachineLearner
    MachineLearner almost 3 years
    Did not help me to really solve this issue. If I open up the file with this method I still see the \ufeff.
  • akostadinov
    akostadinov about 2 years

Related