xml.etree.ElementTree.ParseError: not well-formed (invalid token)

python python-3.x xml-parsing

11,384

The error message W3 Schools validator is misleading. The problem with 0x0c is not that it is invalid UTF-8, it's that it is not a legal character in XML.

0x0c is the form feed control character, so its presence in the document isn't useful. Conforming XML parsers are obliged to reject documents that are not well formed, and you cannot change the rss feed, so the simplest solution is to remove it from the document before processing.

>>> tree = ET.fromstring(original_response, ET.XMLParser(encoding='utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 185, column 1106

>>> fixed = original_response.replace(b'\x0c', b'')
>>> tree = ET.fromstring(fixed, ET.XMLParser(encoding='utf-8'))
>>> tree
<Element 'rss' at 0x7ff316db6278>

11,384

Author by

dataviews

Updated on December 05, 2022

Comments

dataviews over 1 year

Using Python 3

Error we get:

File "C:/scratch.py", line 27, in run
    tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
  File "C:\Programs\Python\Python36-32\lib\xml\etree\ElementTree.py", line 1314, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 163, column 1106

Our code:

tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
    for i in tree.iter('item'):
        try:
            title = i.find('title').text
        except Exception:
            pass

The responses[0] is from a list of url get requests being returned, but in this case of index 0, testing on one specific url: http://feeds.feedburner.com/marginalrevolution/feed

We were able to plug in the XML code to W3 School validator and got:

This page contains the following errors:
error on line 163 at column 31: Input is not in proper UTF-8, indicate encoding! Bytes: 0x0C 0x66 0x69 0x67

But with the ET.XMLParser(encoding='utf-8') property, shouldn't this fix the error when parsing?

dataviews almost 6 years

Worked like a charm! Thank you for a good explanation as well!

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

python - xml.etree.ElementTree.ParseError: not well-formed (invalid token)

How to draw Circle using Turtle in Python 3

convert strings in list to float

How to open command prompt in Administrator mode in python?

python 3.3 socket TypeError

Python3 ImportError: No module named '_tkinter'

How can I see the formulas of an excel spreadsheet in pandas / python?

How to read bytes from file

Difference between ax.set_xlabel() and ax.xaxis.set_label() in MatplotLib 3.0.1

invalid character in identifier error in Python code

xml.etree.ElementTree.ParseError: not well-formed (invalid token)

dataviews

Comments

Recents

Related