SAX parser: Ignoring special characters

java xml parsing sax saxparser

21,559

Solution 1

I think your solution is not too bad: a few lines of code to do exactly what you want. The problem is that startEntity and endEntity methods are not provided by ContentHandler interface, so you have to write a LexicalHandler which works in combination with your ContentHandler. Usually, the use of an XMLFilter is more elegant, but you have to work with entity, so you still should write a LexicalHandler. Take a look here for an introduction to the use of SAX filters.

I'd like to show you a way, very similar to yours, which allows you to separate filtering operations (wrapping & to & for instance) from output operations (or something else). I've written my own XMLFilter based on XMLFilterImpl which also implements LexicalHandler interface. This filter contains only the code related to entites escape/unescape.

public class XMLFilterEntityImpl extends XMLFilterImpl implements
        LexicalHandler {

    private String currentEntity = null;

    public XMLFilterEntityImpl(XMLReader reader)
            throws SAXNotRecognizedException, SAXNotSupportedException {
        super(reader);
        setProperty("http://xml.org/sax/properties/lexical-handler", this);
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (currentEntity == null) {
            super.characters(ch, start, length);
            return;
        }

        String entity = "&" + currentEntity + ";";
        super.characters(entity.toCharArray(), 0, entity.length());
        currentEntity = null;
    }

    @Override
    public void startEntity(String name) throws SAXException {
        currentEntity = name;
    }

    @Override
    public void endEntity(String name) throws SAXException {
    }

    @Override
    public void startDTD(String name, String publicId, String systemId)
            throws SAXException {
    }

    @Override
    public void endDTD() throws SAXException {
    }

    @Override
    public void startCDATA() throws SAXException {
    }

    @Override
    public void endCDATA() throws SAXException {
    }

    @Override
    public void comment(char[] ch, int start, int length) throws SAXException {
    }
}

And this is my main, with a DefaultHandler as ContentHandler which receives the entity as it is according to the filter code:

public static void main(String[] args) throws ParserConfigurationException,
        SAXException, IOException {

    DefaultHandler defaultHandler = new DefaultHandler() {
        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            //This method receives the entity as is
            System.out.println(new String(ch, start, length));
        }
    };

    XMLFilter xmlFilter = new XMLFilterEntityImpl(XMLReaderFactory.createXMLReader());
    xmlFilter.setContentHandler(defaultHandler);
    String xml = "<html><head><title>title</title></head><body>&amp;</body></html>";
    xmlFilter.parse(new InputSource(new StringReader(xml)));
}

And this is my output:

title
&amp;

Probably you don't like it, anyway this is an alternative solution.

I'm sorry, but with SaxParser I think you don't have a more elegant way.

You should also consider switching to StaxParser: it's very easy to do what you want with XMLInputFactory.IS_REPLACING_ENTITY_REFERENCE set to false. If you like this solution, you should take a look here.

Solution 2

If you supply a LexicalHandler as a callback to the SAX parser, it will inform you of the start and end of every entity reference using startEntity() and endEntity() callbacks.

(Note that the JavaDoc at http://download.oracle.com/javase/1.5.0/docs/api/org/xml/sax/ext/LexicalHandler.html talks of "entities" when the correct term is "entity references").

Note also that there is no way to get a SAX parser to tell you about numeric character references such as ሴ. Applications are supposed to treat these in exactly the same way as the original character, so you really shouldn't be interested in them.

Solution 3

There is one more may: escapeXml method of org.apache.commons.lang.StringEscapeUtils class.

Try this code in your characters(char[] ch, int start, int length) method:

String data=new String(ch, start, length);
String escapedData=org.apache.commons.lang.StringEscapeUtils.escapeXml(data);

You may download the jar here.

Solution 4

The temporary solution:

public void startEntity(String name) throws SAXException {
    inEntity = true;
    entityName = name;
}

public void characters(char[] ch, int start, int length) throws SAXException {
    String data;
    if (inEntity) {
        inEntity = false;
        data = "&" + entityName + ";";
    } else {
        data = new String(ch, start, length);
    }
    //TODO do something instead of System.out
    System.out.println(data);
}

But still need elegant solution.

View more solutions

21,559

Author by

Alexander Oleynikov

Updated on July 30, 2022

Comments

Alexander Oleynikov over 1 year

I'm using Xerces to parse my XML document. The issue is that XML escaped characters like &nbsp; appear in characters() method as non-escaped ones. I need to get escaped characters inside characters() method as is.

Thanks.

UPD: Tried to override resolveEntity() method in my DefaultHandler's descendant. Can see from debug that it's set as entity resolver to XML reader but code from overridden method is not invoked.
Alexander Oleynikov about 13 years

Thanks, but how can I intercept resolving entities, not just be aware that they were resolved?
javanna about 13 years

A good idea to escape xml entities, but it doesn't work correctly with   as requested in the question. Maybe you can use StringEscapeUtils#escapeHtml, but you might have some side effects. For instance, if the xml contains the string My name is javanna, your output should be My name is javanna, so you cannot preserve only the original  . @Aleksander O: Do you have both ` ` and   in your xml? Can you accept this side effect?
Raedwald about 13 years

+1 for "Applications are supposed to treat these in exactly the same way as the original character": I think the OP is trying to do something that XML tries to make impossible.
Mike Sokolov over 12 years

There are times when you may care about exact character offsets into an original serialized XML source, even if XML practitioners would like to think that never matters - sometimes XML is just a file. In such a case, you do need to care about the distinction between numeric entities and the characters they represent (and I think there may be a similar issue with the built-in XML entities < " and &). Woodstox 4 (a StaX XML parser) can provide this information, optionally while parsing, but I don't believe Xerces can.
WhyNotHugo about 12 years

What interfaces do you implement? What superclasses does this have? The three or four ones that came to mind don't work with this, this example is really incomplete.
WhyNotHugo about 12 years

Actually, startEntity could just be "char[] c = { '&' }; characters(c, 0, 1);". This is slightly more efficient, since it doesn't involve creating a few temporary strings, and gets the same result.