Parse HTML via XPath

54,999

Solution 1

In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:

>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{http://www.w3.org/1999/xhtml}p")
<Element {http://www.w3.org/1999/xhtml}p at 264eb8>

Solution 2

I'm surprised there isn't a single mention of lxml. It's blazingly fast and will work in any environment that allows CPython libraries.

Here's how you can parse HTML via XPATH using lxml.

>>> from lxml import etree
>>> doc = '<foo><bar></bar></foo>'
>>> tree = etree.HTML(doc)

>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'

>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'

Solution 3

The most stable results I've had have been using lxml.html's soupparser. You'll need to install python-lxml and python-beautifulsoup, then you can do the following:

from lxml.html.soupparser import fromstring
tree = fromstring('<mal form="ed"><html/>here!')
matches = tree.xpath("./mal[@form=ed]")

Solution 4

BeautifulSoup is a good Python library for dealing with messy HTML in clean ways.

Solution 5

It seems the question could be more precisely stated as "How to convert HTML to XML so that XPath expressions can be evaluated against it".

Here are two good tools:

  1. TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
    Taggle is a commercial C++ port of TagSoup.

  2. SgmlReader is a tool developed by Microsoft's Chris Lovett.
    SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
    Download the zip file including the standalone executable and the full source code: SgmlReader.zip

Share:
54,999
Tristan Havelick
Author by

Tristan Havelick

I'm a full-stack web software engineer My professional work currently centers around Python/React/PHP but in the past I've done a lot of Ruby and ASP/ASP.Net/C# stuff. I know a little of a ton of languages, and I'm currently honing managerial skills as well as getting better with front end tech.

Updated on February 25, 2020

Comments

  • Tristan Havelick
    Tristan Havelick about 4 years

    In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?

  • PJP
    PJP over 13 years
    I'd highly recommend Nokogiri these days. It's everything Hpricot was and more.
  • dzen
    dzen almost 13 years
    BeautifulSoup does not use xpath :)
  • Jagtesh Chadha
    Jagtesh Chadha over 12 years
    You might want to consider lxml for Python now
  • Gareth Davidson
    Gareth Davidson about 12 years
    Danger! Use the BeautifulSoup parser for lxml instead as elementtidy will choke on namespaces that aren't declared. I learned the hard way!