Performant parsing of HTML pages with Node.js and XPath

23,239

Solution 1

You can do so in several steps.

  1. Parse HTML with parse5. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
  2. Serialize it to XHTML with xmlserializer that accepts DOM-like structures of parse5 as input.
  3. Parse that XHTML again with xmldom. Now you finally have that DOM.
  4. The xpath library builds upon xmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a won't work.

Finally you get something like this.

const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;

(async () => {
    const html = await fs.readFile('./test.htm');
    const document = parse5.parse(html.toString());
    const xhtml = xmlser.serializeToString(document);
    const doc = new dom().parseFromString(xhtml);
    const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    const nodes = select("//x:a/@href", doc);
    console.log(nodes);
})();

Note that you have to prepend every single HTML element of a query with the x: prefix, for example to match an a inside a div you would need:

//x:div/x:a

Solution 2

Libxmljs is currently the fastest implementation (something like a benchmark) since it's only bindings to the LibXML C-library which supports XPath 1.0 queries:

var libxmljs = require("libxmljs");
var xmlDoc = libxmljs.parseXml(xml);
// xpath queries
var gchild = xmlDoc.get('//grandchild');

However, you need to sanitize your HTML first and convert it to proper XML. For that you could either use the HTMLTidy command line utility (tidy -q -asxml input.html), or if you want it to keep node-only, something like xmlserializer should do the trick.

Solution 3

I think Osmosis is what you're looking for.

  • Uses native libxml C bindings
  • Supports CSS 3.0 and XPath 1.0 selector hybrids
  • Sizzle selectors, Slick selectors, and more
  • No large dependencies like jQuery, cheerio, or jsdom
  • HTML parser features

    • Fast parsing
    • Very fast searching
    • Small memory footprint
  • HTML DOM features

    • Load and search ajax content
    • DOM interaction and events
    • Execute embedded and remote scripts
    • Execute code in the DOM

Here's an example:

osmosis.get(url)
    .find('//div[@class]/ul[2]/li')
    .then(function () {
        count++;
    })
    .done(function () {
        assert.ok(count == 2);
        assert.done();
    });

Solution 4

I have just started using npm install htmlstrip-native which uses a native implementation to parse and extract the relevant html parts. It is claiming to be 50 times faster than the pure js implementation (I have not verified that claim).

Depending on your needs you can use html-strip directly, or lift the code and bindings to make you own C++ used internally in htmlstrip-native

If you want to use xpath, then use the wrapper already avaialble here; https://www.npmjs.org/package/xpath

Solution 5

With just one line, you can do it with xpath-html:

const xpath = require("xpath-html");

const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");
Share:
23,239
polkovnikov.ph
Author by

polkovnikov.ph

(your about me is currently blank)

Updated on December 31, 2020

Comments

  • polkovnikov.ph
    polkovnikov.ph over 3 years

    I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.

    1. jsdom is extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.
    2. Popular libraries for HTML parsing (e.g. cheerio) neither support XPath, nor expose W3C-compliant DOM.
    3. Effective HTML parsing is, obviously, implemented in WebKit, so using phantom or casper would be an option, but those require to be running in a special way, not just node <script>. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to run node-inspector with phantom.
    4. Spooky is an option, but it's buggy enough, so that it didn't run at all on my machine.

    What's the right way to parse an HTML page with XPath then?

  • polkovnikov.ph
    polkovnikov.ph over 9 years
    0. Your link is broken. 1. That library is parsing entities, and that's quite obvious from its name. 2. XPath is not even mentioned in your answer.
  • Soren
    Soren over 9 years
    Fixed the broken link; added link to the xpath implementation, any reason you didn't find/use that yourself?
  • polkovnikov.ph
    polkovnikov.ph over 9 years
    That xpath library has to be run over some kind of DOM. The only solution that parses HTML is jsdom, which is slow as hell. It's the first item in the list up there. Did you read the question?
  • Soren
    Soren over 9 years
    If you were to read the npm xpath documentation you would see that he suggest using the xmldom .
  • polkovnikov.ph
    polkovnikov.ph over 9 years
    And how xmldom is supposed to parse HTML?
  • qqilihq
    qqilihq over 8 years
    Thank you, works perfectly. Except that I needed to replace var document = parser.parse(html.toString()); by var document = parse5.parse(html.toString()); and get rid of the line var parser = new parse5.Parser(); (using parse5 version 2.0.2)
  • Fabiosoft
    Fabiosoft about 5 years
    You are loading everything in memory (the entire DOM)... is there a more memory efficient way to do this?
  • Franck Freiburger
    Franck Freiburger about 4 years
    I wondering if it is possible to create a custom parse5 treeAdapter that would avoid the serializeToString/parseFromString step ? (see github.com/inikulin/parse5/blob/master/packages/parse5/docs/‌​…)
  • polkovnikov.ph
    polkovnikov.ph almost 4 years
    It's good that you've made a library that incapsulates an answer by @pda. If a better approach shows up, it's possible to update only one library. On the other hand it's a little bit shady that you don't mention this is your library, and that this library is basically another answer from this thread.
  • polkovnikov.ph
    polkovnikov.ph almost 4 years
    @Fabiosoft Unfortunately XPath queries do require DOM. There were implementations of subset of XPath that worked over a SAX parser for PHP, but (I almost hope that) there is no such thing on npm.
  • polkovnikov.ph
    polkovnikov.ph almost 4 years
    @FranckFreiburger If I were to do any web crawlers today, I'd just use CSS selectors. They lack features like walking back to some parent, but you won't need anything beyond a call to parse5. XML and tooling around it (like XPath or Java) fell out of mainstream some time back in 2014.
  • Ciro Santilli OurBigBook.com
    Ciro Santilli OurBigBook.com over 3 years
    I wonder if there is any way to not have to add x: before every single element.
  • Ciro Santilli OurBigBook.com
    Ciro Santilli OurBigBook.com over 3 years
    Worth noting that this has one/some serious bugs right now: github.com/hieuvp/xpath-html/issues/10#issuecomment-75224814‌​8