XML parsing - ElementTree vs SAX and DOM

33,027

Solution 1

ElementTree is much easier to use, because it represents an XML tree (basically) as a structure of lists, and attributes are represented as dictionaries.

ElementTree needs much less memory for XML trees than DOM (and thus is faster), and the parsing overhead via iterparse is comparable to SAX. Additionally, iterparse returns partial structures, and you can keep memory usage constant during parsing by discarding the structures as soon as you process them.

ElementTree, as in Python 2.5, has only a small feature set compared to full-blown XML libraries, but it's enough for many applications. If you need a validating parser or complete XPath support, lxml is the way to go. For a long time, it used to be quite unstable, but I haven't had any problems with it since 2.1.

ElementTree deviates from DOM, where nodes have access to their parent and siblings. Handling actual documents rather than data stores is also a bit cumbersome, because text nodes aren't treated as actual nodes. In the XML snippet

<a>This is <b>a</b> test</a>

The string test will be the so-called tail of element b.

In general, I recommend ElementTree as the default for all XML processing with Python, and DOM or SAX as the solutions for specific problems.

Solution 2

Minimal DOM implementation:

Link.

Python supplies a full, W3C-standard implementation of XML DOM (xml.dom) and a minimal one, xml.dom.minidom. This latter one is simpler and smaller than the full implementation. However, from a "parsing perspective", it has all the pros and cons of the standard DOM - i.e. it loads everything in memory.

Considering a basic XML file:

<?xml version="1.0"?>
<catalog>
    <book isdn="xxx-1">
      <author>A1</author>
      <title>T1</title>
    </book>
    <book isdn="xxx-2">
      <author>A2</author>
      <title>T2</title>
    </book>
</catalog>

A possible Python parser using minidom is:

import os
from xml.dom import minidom
from xml.parsers.expat import ExpatError

#-------- Select the XML file: --------#
#Current file name and directory:
curpath = os.path.dirname( os.path.realpath(__file__) )
filename = os.path.join(curpath, "sample.xml")
#print "Filename: %s" % (filename)

#-------- Parse the XML file: --------#
try:
    #Parse the given XML file:
    xmldoc = minidom.parse(filepath)
except ExpatError as e:
    print "[XML] Error (line %d): %d" % (e.lineno, e.code)
    print "[XML] Offset: %d" % (e.offset)
    raise e
except IOError as e:
    print "[IO] I/O Error %d: %s" % (e.errno, e.strerror)
    raise e
else:
    catalog = xmldoc.documentElement
    books = catalog.getElementsByTagName("book")

    for book in books:
        print book.getAttribute('isdn')
        print book.getElementsByTagName('author')[0].firstChild.data
        print book.getElementsByTagName('title')[0].firstChild.data

Note that xml.parsers.expat is a Python interface to the Expat non-validating XML parser (docs.python.org/2/library/pyexpat.html).

The xml.dom package supplies also the exception class DOMException, but it is not supperted in minidom!

The ElementTree XML API:

Link.

ElementTree is much easier to use and it requires less memory than XML DOM. Furthermore, a C implementation is available (xml.etree.cElementTree).

A possible Python parser using ElementTree is:

import os
from xml.etree import cElementTree  # C implementation of xml.etree.ElementTree
from xml.parsers.expat import ExpatError  # XML formatting errors

#-------- Select the XML file: --------#
#Current file name and directory:
curpath = os.path.dirname( os.path.realpath(__file__) )
filename = os.path.join(curpath, "sample.xml")
#print "Filename: %s" % (filename)

#-------- Parse the XML file: --------#
try:
    #Parse the given XML file:
    tree = cElementTree.parse(filename)
except ExpatError as e:
    print "[XML] Error (line %d): %d" % (e.lineno, e.code)
    print "[XML] Offset: %d" % (e.offset)
    raise e
except IOError as e:
    print "[XML] I/O Error %d: %s" % (e.errno, e.strerror)
    raise e
else:
    catalogue = tree.getroot()

    for book in catalogue:
        print book.attrib.get("isdn")
        print book.find('author').text
        print book.find('title').text

Solution 3

ElementTree has more pythonic API. It also is in the standard library now so using it reduces dependencies.

I actually prefer lxml as it has API like ElementTree, but has also nice additional features and performs well.

Solution 4

ElementTree's parse() is like DOM, whereas iterparse() is like SAX. In my opinion, ElementTree is better than DOM and SAX in that it provides API easier to work with.

Share:
33,027

Related videos on Youtube

Corey Goldberg
Author by

Corey Goldberg

"Outside of a dog, a book is a man's best friend. Inside of a dog, it's too dark to read."

Updated on July 05, 2022

Comments

  • Corey Goldberg
    Corey Goldberg almost 2 years

    Python has several ways to parse XML...

    I understand the very basics of parsing with SAX. It functions as a stream parser, with an event-driven API.

    I understand the DOM parser also. It reads the XML into memory and converts it to objects that can be accessed with Python.

    Generally speaking, it was easy to choose between the two depending on what you needed to do, memory constraints, performance, etc.

    (Hopefully I'm correct so far.)

    Since Python 2.5, we also have ElementTree. How does this compare to DOM and SAX? Which is it more similar to? Why is it better than the previous parsers?

  • user1066101
    user1066101 over 15 years
    Also, I find that I want the real structure, not a series of events.
  • giltay
    giltay over 15 years
    A serial parser is often good enough for simple parsing. I started Python using sax, and only switched to minidom when my needs became too complex for sax. I should add that I haven't used ElementTree, yet, since it doesn't seem to offer enough more functionality for me to port my code to it.
  • Jon Coombs
    Jon Coombs over 10 years
    Thank you for mentioning both of the following caveats! (I happen to need both in my project.) "XPath support ... ElementTree deviates from DOM, where nodes have access to their parent and siblings."
  • Jon Coombs
    Jon Coombs over 10 years
    Thanks! Very helpful. I'm not sure enough to edit it, but I think (a) the else is not helpful since there's no finally: stackoverflow.com/questions/855759/python-try-else; (b) a plain raise would preserve more than raise e : stackoverflow.com/questions/11420464/…
  • Paolo Rovelli
    Paolo Rovelli over 10 years
    Regarding point (a), yes. There is no finally statement simply because, in my example, there was no need. I do not remember why I did put it. However, even if useless in this case, having the else statement is not syntactically wrong.
  • Paolo Rovelli
    Paolo Rovelli over 10 years
    Regarding point (b), it might be so. However, I think (in my example) this is a little bit out of scope. Indeed, the code was meant to be just a simple example of XML parsing...
  • Jon Coombs
    Jon Coombs over 10 years
    Oh, I didn't mean that any of it was 'wrong'; just some suggested edits for the sake of others who might come along and copy/paste.
  • gerrit
    gerrit almost 9 years
    Note that the cElementTree is deprecated as of Python 3.3
  • TextGeek
    TextGeek over 4 years
    ElementTree also has the problem that text content is treated as a property of the preceding Node, not as a Node of its own. So in "<p>The <i>brown</i> dog</p>", the <p> element has 1 child, not 3. "The" is a property on the <p>, and "dog" is a property on the <i> (not even the same property -- there might be text right after the <p> ends, too). Nearly everything about trees and contexts and paths and searching works way differently than you're used to if you use HTML, CSS, or most anything else to do with documents.
  • S.B
    S.B almost 3 years
    @JonCoombs ElementTree supports xpath as said in docs