ElementTree iterparse strategy

29,459

Solution 1

Here's one possible approach: we maintain a path list and peek backwards to find the parent node(s).

path = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if event == 'start':
        path.append(elem.tag)
    elif event == 'end':
        # process the tag
        if elem.tag == 'name':
            if 'members' in path:
                print 'member'
            else:
                print 'nonmember'
        path.pop()

Solution 2

pulldom is excellent for this. You get a sax stream. You can iterate through the stream, and when you find a node that your are interested in, load that node in to a dom fragment.

import xml.dom.pulldom as pulldom
import xpath # from http://code.google.com/p/py-dom-xpath/

events = pulldom.parse('families.xml')
for event, node in events:
    if event == 'START_ELEMENT' and node.tagName=='family':
        events.expandNode(node) # node now contains a dom fragment
        family_name = xpath.findvalue('name', node)
        members = xpath.findvalues('members/name', node)
        print('family name: {0}, members: {1}'.format(family_name, members))

output:

family name: Simpson, members: [u'Hommer', u'Marge', u'Bart']
family name: Griffin, members: [u'Peter', u'Brian', u'Meg']
Share:
29,459
Juan Antonio Gomez Moriano
Author by

Juan Antonio Gomez Moriano

Software developer, making my way into the machine learning world

Updated on July 17, 2022

Comments

  • Juan Antonio Gomez Moriano
    Juan Antonio Gomez Moriano almost 2 years

    I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing).

    My concern is the following, imagine you have an xml like this

    <?xml version="1.0" encoding="UTF-8" ?>
    <families>
      <family>
        <name>Simpson</name>
        <members>
            <name>Homer</name>
            <name>Marge</name>
            <name>Bart</name>
        </members>
      </family>
      <family>
        <name>Griffin</name>
        <members>
            <name>Peter</name>
            <name>Brian</name>
            <name>Meg</name>
        </members>
      </family>
    </families>
    

    The problem is, of course to know when I am getting a family name (as Simpsons) and when I am getting the name of one of that family member (for example Homer)

    What I have been doing so far is to use "switches" which will tell me if I am inside a "members" tag or not, the code will look like this

    import xml.etree.cElementTree as ET
    
    __author__ = 'moriano'
    
    file_path = "test.xml"
    context = ET.iterparse(file_path, events=("start", "end"))
    
    # turn it into an iterator
    context = iter(context)
    on_members_tag = False
    for event, elem in context:
        tag = elem.tag
        value = elem.text
        if value :
            value = value.encode('utf-8').strip()
    
        if event == 'start' :
            if tag == "members" :
                on_members_tag = True
    
            elif tag == 'name' :
                if on_members_tag :
                    print "The member of the family is %s" % value
                else :
                    print "The family is %s " % value
    
        if event == 'end' and tag =='members' :
            on_members_tag = False
        elem.clear()
    

    And this works fine as the output is

    The family is Simpson 
    The member of the family is Homer
    The member of the family is Marge
    The member of the family is Bart
    The family is Griffin 
    The member of the family is Peter
    The member of the family is Brian
    The member of the family is Meg
    

    My concern is that with this (simple) example i had to create an extra variable to know in which tag i was (on_members_tag) imagine with the true xml examples that I have to handle, they have more nested tags.

    Also note that this is a very reduced example, so you can assume that i may be facing an xml with more tags, more inner tags and trying to get different tag names, attributes and so on.

    So question is. Am I doing something horribly stupid here? I feel like there must be a more elegant solution to this.