Parse xml with lxml - extract element value

38,819

Solution 1

I would be more direct in your XPath: go straight for the elements you want, in this case datafield.

>>> for df in doc.xpath('//datafield'):
        # Iterate over attributes of datafield
        for attrib_name in df.attrib:
                print '@' + attrib_name + '=' + df.attrib[attrib_name]

        # subfield is a child of datafield, and iterate
        subfields = df.getchildren()
        for subfield in subfields:
                print 'subfield=' + subfield.text

Also, lxml appears to let you ignore the namespace, maybe because your example only uses one namespace?

Solution 2

I would just go with

for df in doc.xpath('//datafield'):
    print df.attrib
    for sf in df.getchildren():
        print sf.text

Also you don't need urllib, you can directly parse XML with HTTP

url = "http://dl.dropbox.com/u/540963/short_test.xml"  #doesn't work with https though
doc = etree.parse(url)

Solution 3

Try the following working code :

import urllib2
from lxml import etree

url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()

for record in doc.xpath('//datafield'):
    print record.xpath("./@tag")[0]
    for x in record.xpath("./subfield/text()"):
        print "\t", x
Share:
38,819
Andrej
Author by

Andrej

Updated on July 09, 2022

Comments

  • Andrej
    Andrej almost 2 years

    Let's suppose we have the XML file with the structure as follows.

    <?xml version="1.0" ?> 
    <searchRetrieveResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/zing/srw/ http://www.loc.gov/standards/sru/sru1-1archive/xml-files/srw-types.xsd" xmlns="http://www.loc.gov/zing/srw/">
      <records xmlns:ns1="http://www.loc.gov/zing/srw/">
        <record>
          <recordData>
            <record xmlns="">
              <datafield tag="000">
                <subfield code="a">123</subfield>
                <subfield code="b">456</subfield>
              </datafield>
              <datafield tag="001">
                <subfield code="a">789</subfield>
                <subfield code="b">987</subfield>
              </datafield>
            </record>
          </recordData>
        </record>
        <record>
          <recordData>
            <record xmlns="">
              <datafield tag="000">
                <subfield code="a">123</subfield>
                <subfield code="b">456</subfield>
              </datafield>
              <datafield tag="001">
                <subfield code="a">789</subfield>
                <subfield code="b">987</subfield>
              </datafield>
            </record>
          </recordData>
        </record>
      </records>
    </searchRetrieveResponse>
    

    I need to parse out:

    • The content of the "subfield" (e.g. 123 in the example above) and
    • Attribute values (e.g. 000 or 001)

    I wonder how to do that using lxml and XPath. Pasted below is my initial code and I kindly ask someone to explain me, how to parse out values.

    import urllib, urllib2
    from lxml import etree    
    
    url = "https://dl.dropbox.com/u/540963/short_test.xml"
    fp = urllib2.urlopen(url)
    doc = etree.parse(fp)
    fp.close()
    
    ns = {'xsi':'http://www.loc.gov/zing/srw/'}
    
    for record in doc.xpath('//xsi:record', namespaces=ns):
        print record.xpath("xsi:recordData/record/datafield[@tag='000']", namespaces=ns)
    
    • MiMo
      MiMo over 11 years
      You are using the prefix xsi for the namespace http://www.loc.gov/zing/srw/ - this is valid, but usually xsi is used as a prefix for the standard namespace http://www.w3.org/2001/XMLSchema-instance.
  • Andrej
    Andrej over 11 years
    Yes, I have only one namespace.