xml.etree.ElementTree.ParseError: not well-formed (invalid token)

11,384

The error message W3 Schools validator is misleading. The problem with 0x0c is not that it is invalid UTF-8, it's that it is not a legal character in XML.

0x0c is the form feed control character, so its presence in the document isn't useful. Conforming XML parsers are obliged to reject documents that are not well formed, and you cannot change the rss feed, so the simplest solution is to remove it from the document before processing.

>>> tree = ET.fromstring(original_response, ET.XMLParser(encoding='utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 185, column 1106

>>> fixed = original_response.replace(b'\x0c', b'')
>>> tree = ET.fromstring(fixed, ET.XMLParser(encoding='utf-8'))
>>> tree
<Element 'rss' at 0x7ff316db6278>
Share:
11,384
dataviews
Author by

dataviews

Updated on December 05, 2022

Comments

  • dataviews
    dataviews over 1 year

    Using Python 3

    Error we get:

    File "C:/scratch.py", line 27, in run
        tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
      File "C:\Programs\Python\Python36-32\lib\xml\etree\ElementTree.py", line 1314, in XML
        parser.feed(text)
    xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 163, column 1106
    

    Our code:

    tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
        for i in tree.iter('item'):
            try:
                title = i.find('title').text
            except Exception:
                pass
    

    The responses[0] is from a list of url get requests being returned, but in this case of index 0, testing on one specific url: http://feeds.feedburner.com/marginalrevolution/feed

    We were able to plug in the XML code to W3 School validator and got:

    This page contains the following errors:
    error on line 163 at column 31: Input is not in proper UTF-8, indicate encoding! Bytes: 0x0C 0x66 0x69 0x67
    

    But with the ET.XMLParser(encoding='utf-8') property, shouldn't this fix the error when parsing?

  • dataviews
    dataviews almost 6 years
    Worked like a charm! Thank you for a good explanation as well!