XML syntax validation in Java

51,182

Solution 1

You can check if an XML document is well-formed using the following code:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(true);

DocumentBuilder builder = factory.newDocumentBuilder();

builder.setErrorHandler(new SimpleErrorHandler());    
// the "parse" method also validates XML, will throw an exception if misformatted
Document document = builder.parse(new InputSource("document.xml"));

The SimpleErrorHandler class referred to in the above code is as follows:

public class SimpleErrorHandler implements ErrorHandler {
    public void warning(SAXParseException e) throws SAXException {
        System.out.println(e.getMessage());
    }

    public void error(SAXParseException e) throws SAXException {
        System.out.println(e.getMessage());
    }

    public void fatalError(SAXParseException e) throws SAXException {
        System.out.println(e.getMessage());
    }
}

This came from this website, which provides various methods for validating XML with Java. Note also that this method loads an entire DOM tree into memory, see comments for alternatives if you want to save on RAM.

Solution 2

What you are asking is how to verify that a piece of content is well-formed XML document. This is easily done by simply letting an XML parser (try to) parse content in question -- if there are issues, parser will report an error by throwing exception. There really isn't anything more to that; so all you need is to figure out how to parse an XML document.

About the only thing to beware is that some libs that claim to be XML parsers are not really proper parsers, in that they actually might not verify things that XML parser must do (as per XML specification) -- in Java, Javolution is an example of something that does little to no checking; VTD-XML and XPP3 do some verification (but not all required checks). And at the other end of spectrum, Xerces and Woodstox check everything that specification mandates. Xerces is bundled with JDK; and most web service frameworks bundle Woodstox in addition.

Since the accepted answer already shows how to parse content into a DOM document (which starts with parsing), that might be enough. The only caveat is that this requires that you have 3-5x as much memory available as raw size of the input document. To get around this limitation you could use a streaming parser, such as Woodstox (which implements standard Stax API). If so, you would create an XMLStreamReader, and just call reader.next() as long as reader.hasNext() returns true.

Share:
51,182
Hristo
Author by

Hristo

LinkedIn JustBeamIt

Updated on October 10, 2020

Comments

  • Hristo
    Hristo over 3 years

    I've been trying to figure out how to check the syntax of an XML file, make sure all tags are closed, there's no random characters, etc... All I care at this point is making sure there is no broken XML in the file.

    I've been looking at some SO posts like these...

    ... but I realized that I don't want to validate the structure of the XML file; I don't want to validate against an XML Schema (XSD)... I just want to check the XML syntax and determine if it is correct.

  • Hristo
    Hristo almost 13 years
    I don't want to use XSD... I'm taking care of that kind of validation elsewhere. I just want to check syntax at the moment.
  • Hristo
    Hristo almost 13 years
    So will this check the syntax of the XML file? I don't want to use an XML Schema here...
  • James Allardice
    James Allardice almost 13 years
    Yes, it will check that the document follows the rules of "well-formedness" set out by the XML spec - w3.org/TR/xml/#sec-well-formed. This means that all elements must be closed, nested properly, etc. In fact, the spec defines well-formedness because you can't always use a DTD.
  • DaVinci
    DaVinci almost 13 years
    this parses it and therefore checks the syntax because otherwise it couldn't parse it... what are you doing with this document anyway that you do need to do this seperatly
  • nsfyn55
    nsfyn55 almost 13 years
    Do you mind telling me what the issue with using XSD is? Do you not want to write XSD? How do you know what version of xml your document is to be compliant with?
  • DaVinci
    DaVinci almost 13 years
    but wouldn't sax be a better choice, performancewise, he's not using the document anyway and therefore doesn't need to hold it in memory
  • James Allardice
    James Allardice almost 13 years
    Yes, probably. That is, if he actually doesn't need the document in memory - I don't think he's implied that really. In that case, there is sample code to do exactly the same thing using SAX here: edankert.com/validate.html
  • Hristo
    Hristo almost 13 years
    No issue... there is code in place already to validate against an XSD. But it doesn't check syntax.
  • Hristo
    Hristo almost 13 years
    Currently, there is a system in place to check an XML file against an XSD (I think its custom) but it doesn't check syntax. So I need to first try and get working code to check syntax, and then maybe improve upon the old system.
  • James Allardice
    James Allardice almost 13 years
    The method I gave you does check syntax. It does not check the document against an XSD.
  • Hristo
    Hristo almost 13 years
    In regards to having the document in memory, if I'm understanding correctly, I do not need the document in memory currently since I'm just trying to make sure its valid XML syntax, not do anything with it.
  • Hristo
    Hristo almost 13 years
    @James Allardice... great! so it checks syntax, but how do report the results of the parse? Do I just do factory.setValidating(true)?
  • James Allardice
    James Allardice almost 13 years
    You will get an error if the document is not well-formed. For example, if the document does not end with the same tag with which it starts, you get the following error: XML document structures must start and end within the same entity.
  • Hristo
    Hristo almost 13 years
    awesome! thanks for the explanation. now going back to the topic of performance... what were you and DaVinci going back and forth about?
  • James Allardice
    James Allardice almost 13 years
    No problem. The method I gave you in my answer uses DOM to parse the document, which builds up a tree of the document as it goes, using up potentially a lot of memory. SAX does not build up a tree of your document. You can find a good comparison of the two here: developerlife.com/tutorials/?p=28
  • Hristo
    Hristo almost 13 years
    If I use SAX, would it validate syntax as well? At the moment, I won't be needing the actual DOM tree. Also, using your example code above, the validation is incorrect... I have one line in my test file <antcall target="test_target"/> asdf, which is incorrect XML because of the random characters asdf but the validation passes.
  • nsfyn55
    nsfyn55 almost 13 years
    If you are validating your XML against an XSD and its not well-formed doesn't your validation catch that?
  • Hristo
    Hristo almost 13 years
    I don't think so... I didn't write it :) It might, but it mostly likely doesn't handle specific syntax issues that may come up.
  • James Allardice
    James Allardice almost 13 years
    Yes, the SAX method will validate syntax too. You are right though, floating text after an element does not cause an error. Try as I might, I cannot find any information about whether or not this is actually valid XML. As it gets through the parser with no problems, I can only assume that it is. Can anyone shed any light on that?
  • James Allardice
    James Allardice almost 13 years
    From what I can work out, characters floating outside of tags have no effect on anything. They are simply ignored by the parser. Because of this, you should never have any problems with characters like this in your XML document. If the floating characters contain a character that interferes with the document, such as a <, an error is thrown.
  • Hristo
    Hristo almost 13 years
    @James... thanks for looking into it. I guess I could post another question regarding this specifically. I'll hold off on that for a little while though and try to figure out why its happening.
  • James Allardice
    James Allardice almost 13 years
    @Hristo - yeah, I'm trying to figure it out too, your problem has intrigued me now! I'm pretty sure that those characters will just be ignored though.
  • Hristo
    Hristo almost 13 years
    @James... :) I'm glad you're on board haha. If these characters get ignored by the parser, that doesn't necessarily mean its valid XML. right?
  • James Allardice
    James Allardice almost 13 years
    I think the reason that they are being ignored is that they are valid - random characters outside of tags mean nothing, so they are ignored. I could be completely wrong though, it's not something I've ever needed to think about before! Why is it important to you to fail validation in a situation like this?
  • Hristo
    Hristo almost 13 years
    @James... I deleted that comment. It was a mistake on my part, my bad :)
  • James Allardice
    James Allardice almost 13 years
    Ha, I just refreshed the page and thought "What!? Where did that comment go!?" I'll delete mine too :)
  • Hristo
    Hristo almost 13 years
    great. so I'll accept your answer because I think it works. thanks for your help!
  • James Allardice
    James Allardice almost 13 years
    No problem, glad I could help! Thanks for accepting. I'm still interested in that "floating character" (as I've now taken to calling them) problem though... If I discover anything, I'll post it here to let you know.
  • Michael Kay
    Michael Kay almost 13 years
    If you're running your data through XSD validation then it will certainly pick up XML well-formedness errors as well.
  • Ted Hopp
    Ted Hopp almost 13 years
    If the parsers don't complain about the trailing characters, I think that's a mistake. The XML validator at xmlvalidation.com, for instance, will complain about random characters after the root element. The XML spec says that a well-formed document can only have white space, an XML comment, or an XML processing instruction after the root element.
  • nsfyn55
    nsfyn55 almost 13 years
    @Everyone Why the downvote? Care to explain yourself?