LXML - Sorting Tag Order

10,634

Solution 1

It's possible to write a helper function to insert a new element in the correct place, but without knowing more about the structure it's difficult to make it generic.

Here's a short example of sorting child elements across the whole document:

from lxml import etree

data = """<X>
    <X03>3</X03>
    <X02>2</X02>
    <A>
        <A02>Y</A02>
        <A01>X</A01>
        <A03>Z</A03>
    </A>
    <X01>1</X01>
    <B>
        <B01>Z</B01>
        <B02>X</B02>
        <B03>C</B03>
    </B>
</X>"""

doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))

for parent in doc.xpath('//*[./*]'): # Search for parent elements
  parent[:] = sorted(parent,key=lambda x: x.tag)

print etree.tostring(doc,pretty_print=True)

Yielding:

<X>
  <A>
    <A01>X</A01>
    <A02>Y</A02>
    <A03>Z</A03>
  </A>
  <B>
    <B01>Z</B01>
    <B02>X</B02>
    <B03>C</B03>
  </B>
  <X01>1</X01>
  <X02>2</X02>
  <X03>3</X03>
</X>

Solution 2

You can sort you xml elements like this:

from operator import attrgetter
from lxml import etree

root = etree.parse(xmlfile)
children = list(root)
sorted_list = sorted(children, key=attrgetter('tag'))

If this running too slow, you might just sort the tag names and get the node using xpath:

tag_list = [item.tag for item in root]
sorted_taglist = sorted(tag_list)

Solution 3

Googling for an XML sorter, I got here. Based on the work of @MattH I made a more complete and adjustable function:

#!python3
from lxml import etree
import sys
if len(sys.argv) < 3:
    print("usage : xml_sorted.py file_in.xml file_out.xml")
    exit(0)
    
filename_in=sys.argv[1]
filename_out=sys.argv[2]

def getSortValue(elem):
    if isinstance(elem,etree._Comment):
        # sort comment by its content
        return elem.text
    else:
        # sort entities by tag and then by name
        return elem.tag + elem.attrib.get('name','')

doc=etree.parse(filename_in)

for parent in doc.xpath('//*[./*]'): # Search for parent elements
    parent[:] = sorted(parent,key=lambda x: getSortValue(x))

with open(filename_out,"wb") as file:
    file.write(etree.tostring(doc,pretty_print=True))

Share:
10,634
George
Author by

George

#SOreadytohelp

Updated on June 06, 2022

Comments

  • George
    George almost 2 years

    I have a legacy file format which I'm converting into XML for processing. The structure can be summarised as:

    <A>
        <A01>X</A01>
        <A02>Y</A02>
        <A03>Z</A03>
    </A>
    

    The numerical part of the tags can go from 01 to 99 and there may be gaps. As part of the processing certain records may have additional tags added. After the processing is completed I'm converting the file back to the legacy format by iterwalking the tree. The files are reasonably large (~150,000 nodes).

    A problem with this is that some software which uses the legacy format assumes that the tags (or rather fields by the time it's converted) will be in alpha-numeric order but by default new tags will be added to the end of the branch which then causes them to come out of the iterator in the wrong order.

    I can use xpath to find the preceeding sibling based on tag name each time I come to add a new tag but my question is whether there's a simpler way to sort the tree at once just prior to export?

    Edit:

    I think I've over summarised the structure.

    A record can contain several levels as described above to give something like:

    <X>
        <X01>1</X01>
        <X02>2</X02>
        <X03>3</X03>
        <A>
            <A01>X</A01>
            <A02>Y</A02>
            <A03>Z</A03>
        </A>
        <B>
            <B01>Z</B02>
            <B02>X</B02>
            <B03>C</B03>
        </B>
    </X>
    
  • George
    George over 12 years
    Thanks - the lamba function does just what I need.
  • Homer6
    Homer6 almost 12 years
  • Sdwdaw
    Sdwdaw over 8 years
    I don't understand why you use parent[:] = in assignment.
  • MattH
    MattH over 8 years
    @Sdwdaw because I want to modify the content of the existing sequence-like object, parent. Not create a new list of that name.
  • D.Vanhaute
    D.Vanhaute about 3 years
    if you get TypeError: '<' not supported between instances of 'cython_function_or_method' and 'str' then add , remove_comments=True to the XMLParser call