lxml (or lxml.html): print tree structure

12,465

Maybe just run some XSLT over the source XML to strip everything but the tags, it's then easy enough to use etree.tostring to get a string you could hash...

from lxml import etree as ET

def pp(e):
    print ET.tostring(e, pretty_print=True)
    print

root = ET.XML("""\
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8" />
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
  <boolean id="import_live">0</boolean>
</preference-set>
</project>
""")
pp(root)


xslt = ET.XML("""\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="*"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>
""")
tr = ET.XSLT(xslt)

doc2 = tr(root)
root2 = doc2.getroot()
pp(root2)

Gives you the output:

<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
  <livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
  <livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8"/>
  <preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
    <boolean id="import_live">0</boolean>
  </preference-set>
</project>

<project>
  <livefolder/>
  <livefolder/>
  <preference-set>
    <boolean/>
  </preference-set>
</project>
Share:
12,465

Related videos on Youtube

lajarre
Author by

lajarre

Hacking around

Updated on August 22, 2022

Comments

  • lajarre
    lajarre over 1 year

    I'd like to print out the tree structure of an etree (formed from an html document) in a differentiable way (means that two etrees should print out differently).

    What I mean by structure is the "shape" of the tree, which basically means all the tags but no attribute and no text content.

    Any idea? Is there something in lxml to do that?

    If not, I guess I have to iterate through the whole tree and construct a string from that. Any idea how to represent the tree in a compact way? (the "compact" feature is less relevant)

    FYI it is not intended to be looked at, but to be stored and hashed to be able to make differences between several html templates.

    Thanks

    • kindall
      kindall over 11 years
      Is there something that the .tostring() method isn't doing for you?
    • Fred Foo
      Fred Foo over 11 years
      I don't think LXML has this functionality built-in, so you'll have to walk the tree.
  • lajarre
    lajarre over 11 years
    Precisely I didn't know much about XSLT and it seems to be the right and standard way of doing what I want
  • spiralx
    spiralx over 11 years
    Once you get into the habit of it then it's really useful for anything where you start with lots of structure and want to turn it into something more manageable. Just remember the default rules are the same as this stylesheet - pastebin.com/b3WHMjPx - so it copies elements and attributes, but nothing else.
  • spiralx
    spiralx over 11 years
    This place has a very good tutorial, and even better reference material for all things XML: zvon.org/comp/m/tutorial.html