How to parse malformed HTML in python, using standard libraries

12,200

Solution 1

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

Solution 2

Take the source code of BeautifulSoup and copy it into your script ;-) I'm only sort of kidding... anything you could write that would do the job would more or less be duplicating the functionality that already exists in libraries like that.

If that's really not going to work, I have to ask, why is it so important that you only use standard library components?

Solution 3

Your choices are to change your requirements or to duplicate all of the work done by the developers of third party modules.

Beautiful soup consists of a single python file with about 2000 lines of code, if that is too big of a dependency, then go ahead and write your own, it won't work as well and probably won't be a whole lot smaller.

Solution 4

doesn't fit your requirement of the std only, but beautifulsoup is nice

Solution 5

I cannot think of any popular languages with a good, robust, heuristic HTML parsing library in its stdlib. Python certainly does not have one, which is something I think you know.

Why the requirement of a stdlib module? Most of the time when I hear people make that requirement, they are being silly. For most major tasks, you will need a third party module or to spend a whole lot of work re-implementing one. Introducing a dependency is a good thing, since that's work you didn't have to do.

So what you want is lxml.html. Ship lxml with your code if that's an issue, at which point it becomes functionally equivalent to writing it yourself except in difficulty, bugginess, and maintainability.

Share:
12,200
bukzor
Author by

bukzor

Updated on June 14, 2022

Comments

  • bukzor
    bukzor about 2 years

    There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

    I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

    Requirements:

    • Use only Python standard library components (any 2.x version)
    • DOM support
    • Handle HTML entities ( )
    • Handle partial documents (like: Hello, <i>World</i>!)

    Bonus points:

    • XPATH support
    • Handle unclosed/malformed tags. (<big>does anyone here know <html ???

    Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...

    from xml.etree.ElementTree import fromstring
    DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))
    
  • bukzor
    bukzor about 14 years
    That's one of the libraries that I referenced with this: "I've found plenty of great third-party libraries for this task, but this question is about the python standard library."
  • bukzor
    bukzor about 14 years
    It's not so important. It's simply my question. As I said, there are tons of html and xml support in the python library. It seems like something there should support this. If not, that's an answer too, but I'm not convinced yet.
  • Mike Graham
    Mike Graham about 14 years
    Note that BeautifulSoup is no longer being maintained. I prefer lxml.html myself. Overall, this is a great answer.
  • David Z
    David Z about 14 years
    Where did you hear that? The BeautifulSoup website shows no evidence that it is no longer being maintained. In fact the most recent release was 11 days ago. (Of course, any other third-party HTML parser works just as well for the argument I was making in the answer)
  • bukzor
    bukzor about 14 years
    From my research, I was seeing that as the most common answer, but I don't know, and I'm still not convinced that there's no such capability in the stdlib. You'll have to admit that a script that uses no external library is much more likely to work correctly for novice users.
  • Nick T
    Nick T about 14 years
    If it's really that compact (never really bothered to look :P ) and he's hell-bent on having a script work without any other dependencies, copy-paste sounds a great plan.
  • Mike Graham
    Mike Graham about 14 years
    Literal copy-and-paste is a ridiculous way to add a dependency.
  • Nick T
    Nick T about 14 years
    Maybe he was thinking BS 3.0 was only for Python 3.x? Their site indicates BS 3.0 is for Py 2.3-2.6, and BS 3.1 is for Py 3.x (though ironically the last BS 3.1 release is about a year old, versus a couple weeks for BS 3.0)
  • Mike Graham
    Mike Graham about 14 years
    @David, Richardson has said multiple times that he is trying his best to quit BS development, though it seems he does still do a little. See e.g. crummy.com/software/BeautifulSoup/3.1-problems.html
  • Mike Graham
    Mike Graham about 14 years
    @bukzor, Well get convinced, since it's the case. =p And I do not have to admit that at all. ;)
  • Ian Bicking
    Ian Bicking about 14 years
    Parsing HTML is something people have only actually understood widely for a few years now; it's taken shockingly long. So it can be said quite definitively that there is nothing in the standard library: BeautifulSoup, html5lib, and lxml.html makes a complete list.
  • bukzor
    bukzor about 14 years
    @Ian Bicking: If you'd make that an answer, I'd check it. Am I getting downrated simply because my answer is no?
  • bukzor
    bukzor about 14 years
    @Mike Graham: Under that link I see this: "... you can use Element Soup to feed the HTML into Beautiful Soup once ElementTree has cleaned it up." Can anyone expand what he means by that? How do you clean up HTML with ElementTree?
  • Mike Graham
    Mike Graham about 14 years
    @bukzor, (It seems a bit odd to ask me about stuff found on a page I presented about why not to use a piece of software.) In any event, as I understand the element tree API, you would call ElementSoup.parse(some_file).write(some_new_place) to parse an HTML file then write the tree you got after reconciling everything less than kosher about it. effbot.org/zone/element-index.htm#documentation provides some information about ElementTree in its various incarnations (which include this and other HTML parsers). Feel free to open a question for a more complete answer.
  • bukzor
    bukzor about 14 years
    @Mike Graham: I just noticed that the quote said said ElementSoup, not ElementTree. I was asking about it because it seemed to imply that I could use ElementTree independent of BeautifulSoup for HTML "cleaning".
  • Mike Graham
    Mike Graham about 14 years
    @bukzor, Cleaning HTML is the topic of another question. The snippet I provide should be the essence of doing it with an ElementTree HTML parser. I don't understand to what you're referring to about "the only reference to html seems to be a side project that is unmaintained since 2007". If you're talking about the ElementTree docs I linked to, stuff not applying to HTML directly is relevant if you're interested in an ElementTree-based HTML parser since the API is independent of the exact format being parsed/generated using ElementTree.
  • Mike Graham
    Mike Graham about 14 years
    @bukzor, ElementSoup is an implementation of ElementTree using BeautifulSoup for parsing. ElementTree is an API with many implementations for parsing XML and HTML.
  • bukzor
    bukzor about 14 years
    Great answer. Thanks! I don't have enough rep to uprate you. QQ I wish people weren't so touchy about hard questions. The good scientist seeks negative experiments as well..
  • bukzor
    bukzor about 14 years
    @Ian Bicking: finally got enough rep to bump you. Just to confirm, there's no known way to get ElementTree (as it exists in the stdlib) to parse real-world HTML?
  • bukzor
    bukzor about 14 years
    @Mike Graham: Thanks. I'm inferring that any HTML parsers implemented with ElementTree are not included in the stdlib. Do you know of a better-maintained etree-html parser than esoup?
  • Mike Graham
    Mike Graham about 14 years
    @bukzor, There are no general-purpose, robust HTML parsers of any kind in the stdlib. lxml.html, which I have mentioned several places, provides an extended ElementTree API. html5lib, which others have mentioned, is compatible with a number of APIs including multiple ElementTree implementations as I best understsand it.
  • Ian Bicking
    Ian Bicking about 14 years
    You can have BeautifulSoup (with ElementSoup) or html5lib parse the HTML and generate an ElementTree structure, but ElementTree itself definitely cannot parse HTML.
  • bukzor
    bukzor about 14 years
    With some finagling and a little bit of HTML-correction, I've gotten ElementTree to parse all of RosettaCode.org. The most annoying part is adding all the html entities to the parser by hand. There's even an option for this in the etree docs, but it's unimplemented for undocumented reasons. You can see the code here: bukzor.hopto.org/svn/software/python/rosetta_pylint.py
  • gsnedders
    gsnedders almost 11 years
    html5lib has no extensions (e.g., C code) that it depends upon. It can optionally use several (such as datrie) to improve performance, but it will work fine without.