Java SAX Parsing

11,334

Solution 1

There is one neat trick when writing a SAX parser: It is allowed to change the ContentHandler of a XMLReader while parsing. This allows to separate the parsing logic for different elements into multiple classes, which makes the parsing more modular and reusable. When one handler sees its end element it switches back to its parent. How many handlers you implement would be left to you. The code would look like this:

public class RootHandler extends DefaultHandler {
    private XMLReader reader;
    private List<Team> teams;

    public RootHandler(XMLReader reader) {
        this.reader = reader;
        this.teams = new LinkedList<Team>();
    }

    public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
        if (name.equals("team")) {
            // Switch handler to parse the team element
            reader.setContentHandler(new TeamHandler(reader, this));
        }
    }
}

public class TeamHandler extends DefaultHandler {
    private XMLReader reader;
    private RootHandler parent;
    private Team team;
    private StringBuilder content;

    public TeamHandler(XMLReader reader, RootHandler parent) {
        this.reader = reader;
        this.parent = parent;
        this.content = new StringBuilder();
        this.team = new Team();
    }

    // characters can be called multiple times per element so aggregate the content in a StringBuilder
    public void characters(char[] ch, int start, int length) throws SAXException {
        content.append(ch, start, length);
    }

    public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
        content.setLength(0);
    }

    public void endElement(String uri, String localName, String name) throws SAXException {
        if (name.equals("name")) {
            team.setName(content.toString());
        } else if (name.equals("team")) {
            parent.addTeam(team);
            // Switch handler back to our parent
            reader.setContentHandler(parent);
        }
    }
}

Solution 2

It's difficult to advise without knowing more about your requirements, but the fact that you are surprised that "my code got quite complex" suggests that you were not well informed when you chose SAX. SAX is a low-level programming interface capable of very high performance, but that's because the parser is doing far less work for you, and you therefore need to do a lot more work yourself.

Solution 3

I strongly recommend to stop parsing yourself, and grab good XML data-binding library. XStream (http://x-stream.github.io/) is may personal favorite, but there many different libraries. It may be even able to parse your POJOs on the spot, without any configuration required (if you use property names and pluralisation to match the XML structure).

Solution 4

I do something very similar, but instead of having boolean flags to tell me what state I'm in, I test for player or team being non-null. Makes things a bit neater. This requires you to set them to null when you detect the end of each element, after you've added it to the relevant list.

Share:
11,334
Haji
Author by

Haji

Updated on June 24, 2022

Comments

  • Haji
    Haji about 2 years

    There's an XML stream which I need to parse. Since I only need to do it once and build my java objects, SAX looks like the natural choice. I'm extending DefaultHandler and implementing the startElement, endElement and characters methods, having members in my class where I save the current read value (taken in the characters method).

    I have no problem doing what I need, but my code got quite complex and I'm sure there's no reason for that and that I can do things differently. The structure of my XML is something like this:

    <players>
      <player>
        <id></id>
        <name></name>
        <teams total="2">
          <team>
            <id></id>
            <name></name>
            <start-date>
              <year>2009</year>
              <month>9</month>
            </start-date>
            <is-current>true</is-current>
          </team>
          <team>
            <id></id>
            <name></name>
            <start-date>
              <year>2007</year>
              <month>11</month>
            </start-date>
            <end-date>
              <year>2009</year>
              <month>7</month>
            </end-date>
          </team>
        </teams>
      </player>
    </players>
    

    My problem started when I realized that the same tag names are used in several areas of the file. For example, id and name exist for both a player and a team. I want to create instances of my java classes Player and Team. While parsing, I kept boolean flags telling me whether I'm in the teams section so that in the endElement I will know that the name is a team's name, not a player's name and so on.

    Here's how my code looks like:

    public class MyParser extends DefaultHandler {
    
        private String currentValue;
        private boolean inTeamsSection = false;
        private Player player;
        private Team team;
        private List<Team> teams;
    
        public void characters(char[] ch, int start, int length) throws SAXException {
            currentValue = new String(ch, start, length);
        }
    
        public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
            if(name.equals("player")){
                player = new Player();
            }
            if (name.equals("teams")) {
                inTeamsSection = true;
                teams = new ArrayList<Team>();
            }
            if (name.equals("team")){
                team = new Team();
            }
        }   
    
        public void endElement(String uri, String localName, String name) throws SAXException {
            if (name.equals("id")) {
                if(inTeamsSection){
                    team.setId(currentValue);
                }
                else{
                    player.setId(currentValue);
                }
            }
            if (name.equals("name")){
                if(inTeamsSection){
                    team.setName(currentValue);
                }
                else{
                    player.setName(currentValue);
                }
            }
            if (name.equals("team")){
                teams.add(team);
            }
            if (name.equals("teams")){
                player.setTeams(teams);
                inTeamsSection = false;
            }
        }
    }
    

    Since in my real scenario I have more nodes to a player in addition to the teams and those nodes also have tags like name and id, I found myself messed up with several booleans similar to the inTeamsSection and my endElement method becomes long and complex with many conditions.

    What should I do differently? How can I know what a name tag, for instance, belongs to?

    Thanks!

  • Oleg Mikheev
    Oleg Mikheev over 12 years
    if there are Subteams, Players etc wouldn't all of them have to contain reference to each other which would result in a VERY tight coupling?
  • Jörn Horstmann
    Jörn Horstmann over 12 years
    Each handler would have to know about its parent handler and the possbible child handlers, so there definitely is some coupling. But for example, the handler for start-date won't need to know about the handler for player.
  • Haji
    Haji over 12 years
    Thanks, I'm now using this treak and it works great for me. Just what I needed for this use-case.