how to get results from xml sax parser in python
Solution 1
To get the content of an element, you need to overwrite the characters
method... add this to your handler class:
def characters(self, data):
print data
Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
def _flushCharBuffer(self):
s = ''.join(self._charBuffer)
self._charBuffer = []
return s
def characters(self, data):
self._charBuffer.append(data)
... and then call the flush method on the end of elements where I need the data.
For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._result = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() #remove strip() if whitespace is important
def parse(self, f):
xml.sax.parse(f, self)
return self._result
def characters(self, data):
self._charBuffer.append(data)
def startElement(self, name, attrs):
if name == 'job': self._result.append({})
def endElement(self, name):
if not name == 'job': self._result[-1][name] = self._getCharacterData()
jobs = MyHandler().parse("job-file.xml") #a list of all jobs
If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement
method - just set _result to a dict and assign to it directly in endElement
.
Solution 2
To get the text content of a node, you need to implement a characters method. E.g.
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
print name,attrs
def endElement(self, name):
print 'end ' + name
def characters(self, content):
print content
Would output:
job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>
title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title
job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
881723
end job-code
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
end detail-url
(sniped)
Solution 3
You need to implement a characters
handler too:
def characters(self, content):
print content
but this potentially gives you text in chunks instead of as one block per tag.
Do yourself a big favour though and use the ElementTree API instead; that API is far pythononic and easier to use than the XML DOM API.
from xml.etree import ElementTree as ET
etree = ET.parse('/path/to/xml_file.xml')
jobtitle = etree.find('job/title').text
If all you want is a straight conversion to a dictionary, take a look at this handy ActiveState Python Cookbook recipe: Converting XML to dictionary and back. Note that it uses the ElementTree API as well.
If you have a set of existing elements you want to look for, just use these in the find()
method:
fieldnames = [
'title', 'job-code', 'detail-url', 'job-category', 'description',
'summary', 'posted-date', 'location', 'address', 'city', 'state',
'zip', 'country', 'company', 'name', 'url']
fields = {}
etree = ET.parse('/path/to/xml_file.xml')
for field in fieldnames:
elem = etree.find(field)
if field is not None and field.text is not None:
fields[field] = elem.text
Solution 4
I would recommend using a pulldom. This allows you to load a doc with a sax parser, and when you find a node that you are interested in, to load just that node into a dom fragment.
Here is an article on using it with some examples: https://www.ibm.com/developerworks/xml/library/x-tipulldom/index.html
Shiva Krishna Bavandla
I love to work on python and django using jquery and ajax.
Updated on July 09, 2022Comments
-
Shiva Krishna Bavandla almost 2 years
I working on xml sax parser to parse xml files and below is my code
xml file code:
<job> <title>Registered Nurse-Epilepsy</title> <job-code>881723</job-code> <detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance </detail-url> <job-category>Neuroscience Nursing</job-category> <description> <summary> <div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL> </summary> </description> <posted-date>2012-07-26</posted-date> <location> <address>7777 Forest Lane</address> <city>Dallas</city> <state>TX</state> <zip>75230</zip> <country>US</country> </location> <company> <name>Medical City (Dallas, TX)</name> <url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url> </company> </job>
Python code: (partial code to clear my doubt until start element function)
from xml.sax.handler import ContentHandler import xml.sax import xml.parsers.expat import ConfigParser class Exact(xml.sax.handler.ContentHandler): def __init__(self): self.curpath = [] def startElement(self, name, attrs): print name,attrs self.clearFields() def endElement(self, name): pass def characters(self, data): self.buffer += data def clearFields(): self.fields = {} self.fields['title'] = None self.fields['job-code'] = None self.fields['detail-url'] = None self.fields['job-category'] = None self.fields['description'] = None self.fields['summary'] = None self.fields['posted-date'] = None self.fields['location'] = None self.fields['address'] = None self.fields['city'] = None self.fields['state'] = None self.fields['zip'] = None self.fields['country'] = None self.fields['company'] = None self.fields['name'] = None self.fields['url'] = None self.buffer = '' if __name__ == '__main__': parser = xml.sax.make_parser() handler = Exact() parser.setContentHandler(handler) parser.parse(open('/path/to/xml_file.xml'))
result: The result to the above print statement is given below
job <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> title <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> description <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> summary <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> posted-date <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> location <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> address <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> city <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> state <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> zip <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> country <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> company <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> name <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70> url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
As you can observe above i am getting
name
andattrs
from the print statement, but now all my intention is to get value of that name, how to fetch the values for all those tags above because i am getting only node names but not values.Edited Code:
i really confused on how to map the data from the nodes to the keys in the dictionary as stated above