How to parse all children and grandchildren elements from a current node using xml.etree.ElementTree in python

11,915

element.findall('./') explicitly only finds direct children of a tag. The expression to find all descendants is .// (double slash).

A simplified version of your loop against the sample given, then results in:

>>> for element in root:
...     print 'parent: ', element.tag, '|', element.attrib
...     print element.text
...     for all_tags in element.findall('.//'):
...         print 'child: ', all_tags.tag, '|', all_tags.attrib
...         if all_tags.text:
...             print all_tags.text, '|', all_tags.tail
... 
parent:  p | {'num': 'p-0003'}

Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,

child:  i | {}
Nature | 
33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,

child:  i | {}
Trends in Biochemical Sciences | 
26, 675-664.

parent:  p | {'num': 'p-0004'}

A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,

child:  i | {}
Nature Reviews in Cancer | 
(2002) 2: 489-501.

parent:  p | {'num': 'p-0005'}

The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.

or using repr() to show string literals instead:

parent:  p | {'num': 'p-0003'}
'\nProtein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,\n'
child:  i | {}
'Nature' | u'\n33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKB\u03b1, Akt2/PKB\u03b2, and Akt3/PKB\u03b3. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,\n'
child:  i | {}
'Trends in Biochemical Sciences' | '\n26, 675-664.\n'
parent:  p | {'num': 'p-0004'}
u'\nA number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110\u03b1, or mutations in the PI3-K regulatory subunit, p85\u03b1, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,\n'
child:  i | {}
'Nature Reviews in Cancer' | '\n(2002) 2: 489-501.\n'
parent:  p | {'num': 'p-0005'}
'\nThe tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.\n'
Share:
11,915
user89423
Author by

user89423

Updated on June 04, 2022

Comments

  • user89423
    user89423 almost 2 years

    I am extracting all the text in xml document. I want to look for tag say description and then search through all the children and grandchildren and there could be even more elements and then extract the text.

    Here is my code but it is not able to get text inside the grandchildren tags:

    for element in root.find('description'):
        print 'parent: ', element.tag, '|', element.attrib
        try:
            data.write(element.text)
            for all_tags in element.findall('./'):
                print 'child: ', all_tags.tag, '|', all_tags.attrib
                if all_tags.text:
                    data.write('\n')
                    data.write(all_tags.text)
                    if all_tags.tail:
                        data.write('\n')
                        data.write(all_tags.tail)
                        data.write('\n')
            data.write('\n')
        except TypeError:
            pass
        except UnicodeEncodeError:
            unicodestr = element.text.encode("utf-8")
            data.write(unicodestr)
    
        data.write('\n')
    

    The problem is in the for all_tags loop.

    sample input:

    <description>
    <p num="p-0003">
    Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,
    <i>Nature</i>
    33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,
    <i>Trends in Biochemical Sciences</i>
    26, 675-664.
    </p>
    <p num="p-0004">
    A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,
    <i>Nature Reviews in Cancer</i>
    (2002) 2: 489-501.
    </p>
    <p num="p-0005">
    The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.
    </p>
    </description>
    

    In this input, the text after <i> Nature </i> is missed and replaced by text in the first line. This is I believe due to the all_tags.tail is getting text from the parent tag not from the children and grandchildren tags.

  • user89423
    user89423 almost 11 years
    Hi Martin, I did that too, but I tried both cases. But I am not able to get text within children/grandchildren tags and using the tail attribute.
  • Martijn Pieters
    Martijn Pieters almost 11 years
    .xpath() is a lxml extension, the OP seems to be using standard library ElementTree instead.
  • user89423
    user89423 almost 11 years
    Right, I am using ElementTree.
  • synthesizerpatel
    synthesizerpatel almost 11 years
    Shouldn't be a problem - elementtree supports xpath. docs.python.org/2/library/…
  • user89423
    user89423 almost 11 years
    Does that mean you modified my first line? Like, I said I want to get one tag of interest---this tag might not be the root and then do the extraction inside. Thanks a lot.
  • Martijn Pieters
    Martijn Pieters almost 11 years
    @user89423: In my setup root is the <description> tag, because you gave us a more limited sample (I had to add in the tag even). Ignore that first line, that's why I stated it is simplified.
  • mzjn
    mzjn almost 11 years
    @synthesizerpatel: ElementTree supports a subset of XPath 1.0, not all of it.
  • user89423
    user89423 almost 11 years
    Thanks, that was very helpful.