Reading HTML file to DOM tree using Java

62,079

Solution 1

JTidy, either by processing the stream to XHTML then using your favourite DOM implementation to re-parse, or using parseDOM if the limited DOM imp that gives you is enough.

Alternatively Neko.

Solution 2

Since HTML files are generally problematic, you'll need to first clean them up using a parser/scanner. I've used JTidy but never happily. NekoHTML works okay, but any of these tools are always just making a best guess of what is intended. You're effectively asking to let a program alter a document's markup until it conforms to a schema. That will likely cause structural (markup), style or content loss. It's unavoidable, and you won't really know what's missing unless you manually scan via a browser (and then you have to trust the browser too).

It really depends on your purpose — if you have thousands of ugly documents with tons of extraneous (non-HTML) markup, then a manual process is probably unreasonable. If your goal is accuracy on a few important documents, then manually fixing them is a reasonable proposition.

One approach is the manual process of repeatedly passing the source through a well-formed and/or validating parser, in an edit cycle using the error messages to eventually fix the broken markup. This does require some understanding of XML, but that's not a bad education to undertake.

With Java 5 the necessary XML features — called the JAXP API — are now built into Java itself; you don't need any external libraries.

You first obtain an instance of a DocumentBuilderFactory, set its features, create a DocumentBuilder (parser), then call its parse() method with an InputSource. InputSource has a number of possible constructors, with a StringReader used in the following example:

import javax.xml.parsers.*;
// ...

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setNamespaceAware(true);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(false);
dbf.setExpandEntityReferences(false);
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(new InputSource(new StringReader(source)));

This returns a DOM Document. If you don't mind using external libraries there's also the JDOM and XOM APIs, and while these have some advantages over the SAX and DOM APIs in JAXP, they do require non-Java libraries to be added. The DOM can be somewhat cumbersome, but after so many years of using it I don't really mind any longer.

Solution 3

Here is a link that might be useful. It's a list of Open Source HTML Parser in Java Open Source HTML Parsers in Java

Share:
62,079
Stefan Teitge
Author by

Stefan Teitge

Updated on July 05, 2022

Comments

  • Stefan Teitge
    Stefan Teitge almost 2 years

    Is there a parser/library which is able to read an HTML document into a DOM tree using Java? I'd like to use the standard DOM/Xpath API that Java provides.

    Most libraries seem have custom API's to solve this task. Furthermore the conversion HTML to XML-DOM seems unsupported by the most of the available parsers.

    Any ideas or experience with a good HTML DOM parser?

  • Stefan Teitge
    Stefan Teitge over 15 years
    Neko + Xerces do the job quite well. Thanks to all answering
  • Joel
    Joel over 14 years
    Beware of JTidy. It has a memory leak bug. If you run it in a production system it will eventually blow up - StackOverflowError and eventually OutOfMemoryError. That said, it is wonderfully good at fixing broken html so that you can feed it into a dom parser.
  • prashant
    prashant over 12 years
    Is there a clean way to use JTidy as a front end to JDOM or XOM in a streaming fashion? That is, without reading the whole document into memory first? (And without using PipedInput/OutputStream and multiple threads?) Or would I be better off just using Neko in that case?
  • Martin Spamer
    Martin Spamer almost 12 years
    Xerces is very strict at validation making it unsuitable for reading real world HTML pages.
  • Mark Bennett
    Mark Bennett over 11 years
    Everybody suggests JTidy or its variants, but another reason to BEWARE is that JTidy isn't that predictable. You will always get warnings from it, and it's hard to tell from that torrent whether the page was really processable or not (my experience was some years back) As I recall it was also fussy about ampersands, which I would have thought easier to recover from than other HTML glitches.
  • spaaarky21
    spaaarky21 over 10 years
    I've used JAXP extensively with XML but I didn't find using JAXP very useful for HTML, even after disabling validation or taking the rest of the steps you've suggested. But perhaps the HTML that I was trying to parse was just too far from being valid XHTML.