How to parse XML using the SAX parser

65,048

Solution 1

So you want to build a XML parser to parse a RSS feed like this one.

<rss version="0.92">
<channel>
    <title>MyTitle</title>
    <link>http://myurl.com</link>
    <description>MyDescription</description>
    <lastBuildDate>SomeDate</lastBuildDate>
    <docs>http://someurl.com</docs>
    <language>SomeLanguage</language>

    <item>
        <title>TitleOne</title>
        <description><![CDATA[Some text.]]></description>
        <link>http://linktoarticle.com</link>
    </item>

    <item>
        <title>TitleTwo</title>
        <description><![CDATA[Some other text.]]></description>
        <link>http://linktoanotherarticle.com</link>
    </item>

</channel>
</rss>

Now you have two SAX implementations you can work with. Either you use the org.xml.sax or the android.sax implementation. I'm going to explain the pro's and con's of both after posting a short hander example.

android.sax Implementation

Let's start with the android.sax implementation.

You have first have to define the XML structure using the RootElement and Element objects.

In any case I would work with POJOs (Plain Old Java Objects) which would hold your data. Here would be the POJOs needed.

Channel.java

public class Channel implements Serializable {

    private Items items;
    private String title;
    private String link;
    private String description;
    private String lastBuildDate;
    private String docs;
    private String language;

    public Channel() {
        setItems(null);
        setTitle(null);
        // set every field to null in the constructor
    }

    public void setItems(Items items) {
        this.items = items;
    }

    public Items getItems() {
        return items;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getTitle() {
        return title;
    }
    // rest of the class looks similar so just setters and getters
}

This class implements the Serializable interface so you can put it into a Bundle and do something with it.

Now we need a class to hold our items. In this case I'm just going to extend the ArrayList class.

Items.java

public class Items extends ArrayList<Item> {

    public Items() {
        super();
    }

}

Thats it for our items container. We now need a class to hold the data of every single item.

Item.java

public class Item implements Serializable {

    private String title;
    private String description;
    private String link;

    public Item() {
        setTitle(null);
        setDescription(null);
        setLink(null);
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getTitle() {
        return title;
    }

    // same as above.

}

Example:

public class Example extends DefaultHandler {

    private Channel channel;
    private Items items;
    private Item item;

    public Example() {
        items = new Items();
    }

    public Channel parse(InputStream is) {
        RootElement root = new RootElement("rss");
        Element chanElement = root.getChild("channel");
        Element chanTitle = chanElement.getChild("title");
        Element chanLink = chanElement.getChild("link");
        Element chanDescription = chanElement.getChild("description");
        Element chanLastBuildDate = chanElement.getChild("lastBuildDate");
        Element chanDocs = chanElement.getChild("docs");
        Element chanLanguage = chanElement.getChild("language");

        Element chanItem = chanElement.getChild("item");
        Element itemTitle = chanItem.getChild("title");
        Element itemDescription = chanItem.getChild("description");
        Element itemLink = chanItem.getChild("link");

        chanElement.setStartElementListener(new StartElementListener() {
            public void start(Attributes attributes) {
                channel = new Channel();
            }
        });

        // Listen for the end of a text element and set the text as our
        // channel's title.
        chanTitle.setEndTextElementListener(new EndTextElementListener() {
            public void end(String body) {
                channel.setTitle(body);
            }
        });

        // Same thing happens for the other elements of channel ex.

        // On every <item> tag occurrence we create a new Item object.
        chanItem.setStartElementListener(new StartElementListener() {
            public void start(Attributes attributes) {
                item = new Item();
            }
        });

        // On every </item> tag occurrence we add the current Item object
        // to the Items container.
        chanItem.setEndElementListener(new EndElementListener() {
            public void end() {
                items.add(item);
            }
        });

        itemTitle.setEndTextElementListener(new EndTextElementListener() {
            public void end(String body) {
                item.setTitle(body);
            }
        });

        // and so on

        // here we actually parse the InputStream and return the resulting
        // Channel object.
        try {
            Xml.parse(is, Xml.Encoding.UTF_8, root.getContentHandler());
            return channel;
        } catch (SAXException e) {
            // handle the exception
        } catch (IOException e) {
            // handle the exception
        }

        return null;
    }

}

Now that was a very quick example as you can see. The major advantage of using the android.sax SAX implementation is that you can define the structure of the XML you have to parse and then just add an event listener to the appropriate elements. The disadvantage is that the code get quite repeating and bloated.

org.xml.sax Implementation

The org.xml.sax SAX handler implementation is a bit different.

Here you don't specify or declare you XML structure but just listening for events. The most widely used ones are following events:

  • Document Start
  • Document End
  • Element Start
  • Element End
  • Characters between Element Start and Element End

An example handler implementation using the Channel object above looks like this.

Example

public class ExampleHandler extends DefaultHandler {

    private Channel channel;
    private Items items;
    private Item item;
    private boolean inItem = false;

    private StringBuilder content;

    public ExampleHandler() {
        items = new Items();
        content = new StringBuilder();
    }

    public void startElement(String uri, String localName, String qName, 
            Attributes atts) throws SAXException {
        content = new StringBuilder();
        if(localName.equalsIgnoreCase("channel")) {
            channel = new Channel();
        } else if(localName.equalsIgnoreCase("item")) {
            inItem = true;
            item = new Item();
        }
    }

    public void endElement(String uri, String localName, String qName) 
            throws SAXException {
        if(localName.equalsIgnoreCase("title")) {
            if(inItem) {
                item.setTitle(content.toString());
            } else {
                channel.setTitle(content.toString());
            }
        } else if(localName.equalsIgnoreCase("link")) {
            if(inItem) {
                item.setLink(content.toString());
            } else {
                channel.setLink(content.toString());
            }
        } else if(localName.equalsIgnoreCase("description")) {
            if(inItem) {
                item.setDescription(content.toString());
            } else {
                channel.setDescription(content.toString());
            }
        } else if(localName.equalsIgnoreCase("lastBuildDate")) {
            channel.setLastBuildDate(content.toString());
        } else if(localName.equalsIgnoreCase("docs")) {
            channel.setDocs(content.toString());
        } else if(localName.equalsIgnoreCase("language")) {
            channel.setLanguage(content.toString());
        } else if(localName.equalsIgnoreCase("item")) {
            inItem = false;
            items.add(item);
        } else if(localName.equalsIgnoreCase("channel")) {
            channel.setItems(items);
        }
    }

    public void characters(char[] ch, int start, int length) 
            throws SAXException {
        content.append(ch, start, length);
    }

    public void endDocument() throws SAXException {
        // you can do something here for example send
        // the Channel object somewhere or whatever.
    }

}

Now to be honest I can't really tell you any real advantage of this handler implementation over the android.sax one. I can however tell you the disadvantage which should be pretty obvious by now. Take a look at the else if statement in the startElement method. Due to the fact that we have the tags <title>, link and description we have to track there in the XML structure we are at the moment. That is if we encounter a <item> starting tag we set the inItem flag to true to ensure that we map the correct data to the correct object and in the endElement method we set that flag to false if we encounter a </item> tag. To signalize that we are done with that item tag.

In this example it is pretty easy to manage that but having to parse a more complex structure with repeating tags in different levels becomes tricky. There you'd have to either use Enums for example to set your current state and a lot of switch/case statemenets to check where you are or a more elegant solution would be some kind of tag tracker using a tag stack.

Solution 2

In many problems it is necessary to use different kinds of xml files for different purposes. I will not attempt to grasp the immensity and tell from my own experience what I needed all this.

Java, perhaps, my favorite programming language. In addition, this love is strengthened by the fact that you can solve any problem and come up with a bike is not necessary.

So, it took me to create a bunch of client-server running a database that would allow the client to remotely make entries in the database server. Needless to be checking input data, etc. and the like, but it's not about that.

As a principle of work, I, without hesitation, chose the transmission of information in the form of xml file. Of the following types:

<? xml version = "1.0" encoding = "UTF-8" standalone = "no"?> 
<doc> 
<id> 3 </ id> 
<fam> Ivanov </ fam> 
<name> Ivan </ name> 
<otc> I. </ otc> 
<dateb> 10-03-2005 </ dateb> 
<datep> 10-03-2005 </ datep> 
<datev> 10-03-2005 </ datev> 
<datebegin> 09-06-2009 </ datebegin> 
<dateend> 10-03-2005 </ dateend> 
<vdolid> 1 </ vdolid> 
<specid> 1 </ specid> 
<klavid> 1 </ klavid> 
<stav> 2.0 </ stav> 
<progid> 1 </ progid> 
</ doc> 

Make it easier to read any further, except to say that it is the information about doctors institutions. Last name, first name, unique id, and so on. In general, the data series. This file safely got on the server side, and then start parsing the file.

Of the two options parsing (SAX vs DOM) I chose SAX view of the fact that he works more bright, and he was the first I fell into the hands :)

So. As you know, to work successfully with the parser, we need to override the needed methods DefaultHandler's. To begin, connect the required packages.

import org.xml.sax.helpers.DefaultHandler; 
import org.xml.sax. *; 

Now we can start writing our parser

public class SAXPars extends DefaultHandler {
   ... 
} 

Let's start with the method startDocument (). He, as the name implies, reacts to an event beginning of the document. Here you can hang a variety of actions such as memory allocation, or to reset the values​​, but our example is pretty simple, so just mark the beginning of work of an appropriate message:

Override 
public void startDocument () throws SAXException {
   System.out.println ("Start parse XML ..."); 
} 

Next. The parser goes through the document meets the element of its structure. Starts method startElement (). And in fact, his appearance this: startElement (String namespaceURI, String localName, String qName, Attributes atts). Here namespaceURI - the namespace, localName - the local name of the element, qName- a combination of local name with a namespace (separated by a colon) and atts - the attributes of this element. In this case, all simple. It suffices to use qName'om and throw it into some service line thisElement. Thus we mark in which the element at the moment we are.

@Override 
public void startElement (String namespaceURI, String localName, String qName, Attributes atts) throws SAXException {
   thisElement = qName; 
} 

Next, meeting item we get to its meaning. Here include methods characters (). He has the form: characters (char [] ch, int start, int length). Well here everything is clear. ch - a file containing the string itself self-importance within this element. start and length - the number of service indicating the starting point in the line and length.

@Override 
public void characters (char [] ch, int start, int length) throws SAXException {
   if (thisElement.equals ("id")) {
      doc.setId (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("fam")) {
      doc.setFam (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("name")) {
      doc.setName (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("otc")) {
      doc.setOtc (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("dateb")) {
      doc.setDateb (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("datep")) {
      doc.setDatep (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("datev")) {
      doc.setDatev (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("datebegin")) {
      doc.setDatebegin (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("dateend")) {
      doc.setDateend (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("vdolid")) {
      doc.setVdolid (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("specid")) {
      doc.setSpecid (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("klavid")) {
      doc.setKlavid (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("stav")) {
      doc.setStav (new Float (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("progid")) {
      doc.setProgid (new Integer (new String (ch, start, length))); 
   } 
} 

Ah, yes. I almost forgot. As the object of which will be to fold naparsennye data speaks to the type of Doctors. This class is defined and has all the necessary setters-getters.

Next obvious element ends and it is followed by the next. Responsible for ending the endElement (). It signals to us that the item has ended and you can do anything at this time. Will proceed. Cleanse Element.

@Override 
public void endElement (String namespaceURI, String localName, String qName) throws SAXException {
   thisElement = ""; 
} 

Coming so the entire document, we come to the end of the file. Work endDocument (). In it, we can free up memory, do some diagnostichesuyu printing, etc. In our case, just write about what parsing ends.

@Override 
public void endDocument () {
   System.out.println ("Stop parse XML ..."); 
} 

So we got a class to parse xml our format. Here is the full text:

import org.xml.sax.helpers.DefaultHandler; 
import org.xml.sax. *; 
 
public class SAXPars extends DefaultHandler {
 
Doctors doc = new Doctors (); 
String thisElement = ""; 
 
public Doctors getResult () {
   return doc; 
} 
 
@Override 
public void startDocument () throws SAXException {
   System.out.println ("Start parse XML ..."); 
} 
 
@Override 
public void startElement (String namespaceURI, String localName, String qName, Attributes atts) throws SAXException {
   thisElement = qName; 
} 
 
@Override 
public void endElement (String namespaceURI, String localName, String qName) throws SAXException {
   thisElement = ""; 
} 
 
@Override 
public void characters (char [] ch, int start, int length) throws SAXException {
   if (thisElement.equals ("id")) {
      doc.setId (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("fam")) {
      doc.setFam (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("name")) {
      doc.setName (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("otc")) {
      doc.setOtc (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("dateb")) {
      doc.setDateb (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("datep")) {
      doc.setDatep (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("datev")) {
      doc.setDatev (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("datebegin")) {
      doc.setDatebegin (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("dateend")) {
      doc.setDateend (new String (ch, start, length)); 
   } 
   if (thisElement.equals ("vdolid")) {
      doc.setVdolid (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("specid")) {
      doc.setSpecid (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("klavid")) {
      doc.setKlavid (new Integer (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("stav")) {
      doc.setStav (new Float (new String (ch, start, length))); 
   } 
   if (thisElement.equals ("progid")) {
      doc.setProgid (new Integer (new String (ch, start, length))); 
   } 
} 
 
@Override 
public void endDocument () {
   System.out.println ("Stop parse XML ..."); 
} 
} 

I hope the topic helped to easily present the essence of the SAX parser.

Do not judge strictly first article :) I hope it was at least someone useful.

UPD: To run this parser, you can use this code:

SAXParserFactory factory = SAXParserFactory.newInstance (); 
SAXParser parser = factory.newSAXParser (); 
SAXPars saxp = new SAXPars (); 
 
parser.parse (new File ("..."), saxp); 

Share:
65,048
Johan
Author by

Johan

Updated on July 09, 2022

Comments

  • Johan
    Johan almost 2 years

    I'm following this tutorial.

    It works great but I would like it to return an array with all the strings instead of a single string with the last element.

    Any ideas how to do this?