Reading XML document nodes containing special characters (&, -, etc) with Java

java xml parsing special-characters

14,789

Solution 1

The & is an escape character in XML. XML that looks like this:

<theaterName>P&G Greenbelt</theaterName>

should actually be rejected by the parser. Instead, it should look like this:

<theaterName>P&amp;G Greenbelt</theaterName>

There are a few such characters, such as < (<), > (>), " (") and ' ('). There are also other ways to escape characters, such as via their Unicode value, as in • or 〹.

For more information, the XML specification is fairly clear.

Now, the other thing it might be, depending on how your tree was constructed, is that the character is escaped properly, and the sample you showed isn't what's actually there, and it's how the data is represented in the tree.

For example, when using SAX to build a tree, entities (the &-thingies) are broken apart and delivered separately. This is because the SAX parser tries to return contiguous chunks of data, and when it gets to the escape character, it sends what it has, and starts a new chunk with the translated &-value. So you might need to combine consecutive text nodes in your tree to get the whole value.

Solution 2

The file you are trying to read is not valid XML. No self-respecting XML parser will accept it.

I'm retrieving my XML dynamically from the web. What's the best way to replace all my escape characters after fetching the Document object?

You are taking the wrong approach. The correct approach is to inform the people responsible for creating that file that it is invalid, and request that they fix it. Simply writing hacks to (try to) fix broken XML is not in your (or other peoples') long term interest.

If you decide to ignore this advice, then one approach is to read the file into a String, use String.replaceAll(regex, replacement) with a suitable regex to turn these bogus "&" characters into proper character entities ("&"), then pass the "fixed" XML string to the XML parser. You need to carefully design the regex so that it doesn't break valid character entities as an unwanted side-effect. A second approach is to do the parsing and replacement by hand, using appropriate heuristics to distinguish the bogus "&" characters from well-formed character entities.

But this all costs you development and test time, and slows down your software. Worse, there is a significant risk that your code will be fragile as a result of your efforts to compensate for the bad input files. (And guess who will get the blame!)

14,789

Author by

Dan

Updated on June 05, 2022

Comments

Dan almost 2 years

My code does not retrieve the entirety of element nodes that contain special characters. For example, for this node:

<theaterName>P&G Greenbelt</theaterName>

It would only retrieve "P" due to the ampersand. I need to retrieve the entire string.

Here's my code:

public List<String> findTheaters() {

    //Clear theaters application global
    FilmhopperActivity.tData.clearTheaters();

    ArrayList<String> theaters = new ArrayList<String>();

    NodeList theaterNodes = doc.getElementsByTagName("theaterName");

    for (int i = 0; i < theaterNodes.getLength(); i++) {

        Node node = theaterNodes.item(i);
        if (node.getNodeType() == Node.ELEMENT_NODE) {

            //Found theater, add to return array
            Element element = (Element) node;
            NodeList children = element.getChildNodes();
            String name = children.item(0).getNodeValue();
            theaters.add(name);

            //Logging
            android.util.Log.i("MoviefoneFetcher", "Theater found: " + name);

            //Add theater to application global
            Theater t = new Theater(name);
            FilmhopperActivity.tData.addTheater(t);
        }
    }

    return theaters;
}

I tried adding code to extend the name string to concatenate additional children.items, but it didn't work. I'd only get "P&".

...
String name = children.item(0).getNodeValue();
for (int j = 1; j < children.getLength() - 1; j++) {
    name += children.item(j).getNodeValue();
}

Thanks for your time.

UPDATE: Found a function called normalize() that you can call on Nodes, that combines all text child nodes so doing a children.item(0) contains the text of all the children, including ampersands!