Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?

11,937

You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:

>>> '\u00ad'.encode('utf8')
b'\xc2\xad'

Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.

I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace' is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.

Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt, I can't decode the data as UTF-8:

>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A'
 b'CTIVITIES:\t\nProceedsFromSaleOfIn')

There are two such non-ASCII characters in the file:

>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0'
 b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I'
 b'NVESTING AND FINANCING ACTIVITIES:\t\n',
 b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani'
 b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h'
 b'e.\n']

Hotel Kranichh\xf6he decoded as Latin-1 is Hotel Kranichhöhe.

There are also several 0xC1 / 0xD1 pairs in the file:

>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'\t')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th'
>>> quotes[1].split(b'\t')[-1][50:130]
b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'

I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C and 1D parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!

There is no codec shipping with Python that would encode '\u201C\u201D' to b'\x1C\x1D', making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.

If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:

_map = {
    # dashes
    0x13: '\u2013', 0x14: '\u2014',
    # single quotes
    0x18: '\u2018', 0x19: '\u2019',
    # double quotes
    0x1c: '\u201c', 0x1d: '\u201d',
}
def repair(line, _map=_map):
    """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
    return line.translate(_map)

then apply that to lines you read:

with open(filename, 'r', encoding='latin-1') as f:
    repaired = map(repair, f)
    fields = next(repaired).strip().split('\t')
    for line in repaired:
        yield process_tag_record(fields, line)

Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open(); that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open(). Do not use f.readlines(); you don't need to read the whole file into a list here. Just iterate over the file directly:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        fields = next(f).strip().split('\t')
        for line in f:
            yield process_tag_record(fields, line)

If process_tag_record also splits on tabs, use a csv.reader() object and avoid splitting each row manually:

import csv

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.reader(f, delimiter='\t')
        fields = next(reader)
        for row in reader:
            yield process_tag_record(fields, row)

If process_tag_record combines the fields list with the values in row to form a dictionary, just use csv.DictReader() instead:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.DictReader(f, delimiter='\t')
        # first row is used as keys for the dictionary, no need to read fields manually.
        yield from reader
Share:
11,937

Related videos on Youtube

MikeRand
Author by

MikeRand

Financial analyst

Updated on August 02, 2022

Comments

  • MikeRand
    MikeRand over 1 year

    I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code:

    def tags(filename):
        """Yield Tag instances from tag.txt."""
        with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f:
            fields = f.readline().strip().split('\t')
            for line in f.readlines():
                yield process_tag_record(fields, line)
    

    I receive the following error:

    Traceback (most recent call last):
      File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module>
        main()
      File "/home/randm/Projects/finance/secxbrl.py", line 143, in main
        all_tags = list(tags("tag.txt"))
      File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags
        content = f.read()
      File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read
        return self.reader.read(size)
      File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read
        newchars, decodedbytes = self.decode(data, self.errors)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
    

    Given that I probably can't go back to the SEC and tell them they have files that don't seem to be encoded in UTF-8, how should I debug and catch this error?

    What have I tried

    I did a hexdump of the file and found that the offending text was the text "SUPPLEMENTAL DISCLOSURE OF NON�CASH INVESTING". If I decode the offending byte as a hex code point (i.e. "U+00AD"), it makes sense in context as it is the soft hyphen. But the following does not seem to work:

    Python 3.5.2 (default, Nov 17 2016, 17:05:23)
    [GCC 5.4.0 20160609] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> b"\x41".decode("utf-8")
    'A'
    >>> b"\xad".decode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte
    >>> b"\xc2ad".decode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte
    

    I've used errors='replace', which seems to pass. But I'd like to understand what will happen if I try to insert that into a database.

    Hexdump:

    0036ae40  31 09 09 09 09 53 55 50  50 4c 45 4d 45 4e 54 41  |1....SUPPLEMENTA|
    0036ae50  4c 20 44 49 53 43 4c 4f  53 55 52 45 20 4f 46 20  |L DISCLOSURE OF |
    0036ae60  4e 4f 4e ad 43 41 53 48  20 49 4e 56 45 53 54 49  |NON.CASH INVESTI|
    0036ae70  4e 47 20 41 4e 44 20 46  49 4e 41 4e 43 49 4e 47  |NG AND FINANCING|
    0036ae80  20 41 43 54 49 56 49 54  49 45 53 3a 09 0a 50 72  | ACTIVITIES:..Pr|
    
    • Martijn Pieters
      Martijn Pieters about 6 years
      In Python 3.6, do not use codecs.open(). The standard open() function can handle encoded data better and faster.
    • Martijn Pieters
      Martijn Pieters about 6 years
      @HåkenLid: except there is no known encoding that can produce the output the SEC produced. They have produced an invalid codec.
  • MikeRand
    MikeRand about 6 years
    I have to a do a bit more work in process_tag_record than just zipping and returning (e.g. converting data to Python data types, creating a SQLAlchemy instance), but yes, that would work better if it were just a zip and return.
  • Mark Tolonen
    Mark Tolonen about 6 years
    Per your "UTF-16 with high bytes stripped", that's exactly what it looks like. There are also single quotes and em and en dashes that follow the same pattern.
  • Martijn Pieters
    Martijn Pieters almost 5 years
    @tripleee: interesting. I generally use the fileformat.info characterset and unicode pages to cross-reference characters; they have comprehensive, per-codepoint listings of character sets and Windows codepages to check against. And for other broken encoding problems the ftfy project is invaluable.
  • Martijn Pieters
    Martijn Pieters almost 5 years
    @tripleee: and I only now noticed that your links go to fileformat.info :-D
  • Martijn Pieters
    Martijn Pieters over 4 years
    @tripleee: thanks again for that page, it was helpful in finding a weird codec once more.
  • tripleee
    tripleee over 2 years
    I can't fix the broken link, but I can point to the new location: tripleee.github.io/8bit