how to get results from xml sax parser in python

20,139

Solution 1

To get the content of an element, you need to overwrite the characters method... add this to your handler class:

def characters(self, data):
    print data

Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []

    def _flushCharBuffer(self):
        s = ''.join(self._charBuffer)
        self._charBuffer = []
        return s

    def characters(self, data):
        self._charBuffer.append(data)

... and then call the flush method on the end of elements where I need the data.

For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []
        self._result = []

    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        return data.strip() #remove strip() if whitespace is important

    def parse(self, f):
        xml.sax.parse(f, self)
        return self._result

    def characters(self, data):
        self._charBuffer.append(data)

    def startElement(self, name, attrs):
        if name == 'job': self._result.append({})

    def endElement(self, name):
        if not name == 'job': self._result[-1][name] = self._getCharacterData()

jobs = MyHandler().parse("job-file.xml") #a list of all jobs

If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement method - just set _result to a dict and assign to it directly in endElement.

Solution 2

To get the text content of a node, you need to implement a characters method. E.g.

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    print name,attrs


  def endElement(self, name):
    print 'end ' + name

  def characters(self, content):
    print content

Would output:

job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>



title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title



job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
881723
end job-code



detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance



end detail-url

(sniped)

Solution 3

You need to implement a characters handler too:

def characters(self, content):
    print content

but this potentially gives you text in chunks instead of as one block per tag.

Do yourself a big favour though and use the ElementTree API instead; that API is far pythononic and easier to use than the XML DOM API.

from xml.etree import ElementTree as ET

etree = ET.parse('/path/to/xml_file.xml')
jobtitle = etree.find('job/title').text

If all you want is a straight conversion to a dictionary, take a look at this handy ActiveState Python Cookbook recipe: Converting XML to dictionary and back. Note that it uses the ElementTree API as well.

If you have a set of existing elements you want to look for, just use these in the find() method:

fieldnames = [
    'title', 'job-code', 'detail-url', 'job-category', 'description',
    'summary', 'posted-date', 'location', 'address', 'city', 'state',
    'zip', 'country', 'company', 'name', 'url']
fields = {}

etree = ET.parse('/path/to/xml_file.xml')

for field in fieldnames:
    elem = etree.find(field)
    if field is not None and field.text is not None:
        fields[field] = elem.text

Solution 4

I would recommend using a pulldom. This allows you to load a doc with a sax parser, and when you find a node that you are interested in, to load just that node into a dom fragment.

Here is an article on using it with some examples: https://www.ibm.com/developerworks/xml/library/x-tipulldom/index.html

Share:
20,139
Shiva Krishna Bavandla
Author by

Shiva Krishna Bavandla

I love to work on python and django using jquery and ajax.

Updated on July 09, 2022

Comments

  • Shiva Krishna Bavandla
    Shiva Krishna Bavandla almost 2 years

    I working on xml sax parser to parse xml files and below is my code

    xml file code:

    <job>
        <title>Registered Nurse-Epilepsy</title>
        <job-code>881723</job-code>
        <detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
        </detail-url>
        <job-category>Neuroscience Nursing</job-category>
        <description>
            <summary>
                <div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL>
           </summary>
        </description>
        <posted-date>2012-07-26</posted-date>
        <location>
           <address>7777 Forest Lane</address>
           <city>Dallas</city>
           <state>TX</state>
           <zip>75230</zip>
           <country>US</country>
        </location>
        <company>
           <name>Medical City (Dallas, TX)</name>
          <url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url>
        </company>
    </job> 
    

    Python code: (partial code to clear my doubt until start element function)

    from xml.sax.handler import ContentHandler
    import xml.sax
    import xml.parsers.expat
    import ConfigParser
    
    class Exact(xml.sax.handler.ContentHandler):
      def __init__(self):
        self.curpath = []
    
      def startElement(self, name, attrs):
        print name,attrs
        self.clearFields()
    
    
      def endElement(self, name):
        pass
    
      def characters(self, data):
        self.buffer += data
    
      def clearFields():
        self.fields = {}
        self.fields['title'] = None
        self.fields['job-code'] = None
        self.fields['detail-url'] = None
        self.fields['job-category'] = None
        self.fields['description'] = None
        self.fields['summary'] = None
        self.fields['posted-date'] = None
        self.fields['location'] = None
        self.fields['address'] = None
        self.fields['city'] = None
        self.fields['state'] = None
        self.fields['zip'] = None
        self.fields['country'] = None
        self.fields['company'] = None
        self.fields['name'] = None
        self.fields['url'] = None
        
        self.buffer = ''
          
    if __name__ == '__main__':
      parser = xml.sax.make_parser()
      handler = Exact()
      parser.setContentHandler(handler)
      parser.parse(open('/path/to/xml_file.xml'))
    

    result: The result to the above print statement is given below

    job     <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    title   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    description  <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    summary       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    posted-date   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    location      <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    address       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    city          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    state         <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    zip           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    country       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    company       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    name          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    url           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
    

    As you can observe above i am getting name and attrs from the print statement, but now all my intention is to get value of that name, how to fetch the values for all those tags above because i am getting only node names but not values.

    Edited Code:

    i really confused on how to map the data from the nodes to the keys in the dictionary as stated above