BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?
Solution 1
From the docs's summarized table of advantages and disadvantages:
-
html.parser -
BeautifulSoup(markup, "html.parser")
Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
-
lxml -
BeautifulSoup(markup, "lxml")
Advantages: Very fast, Lenient
Disadvantages: External C dependency
-
html5lib -
BeautifulSoup(markup, "html5lib")
Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
Disadvantages: Very slow, External Python dependency
Solution 2
The key differences are highlighted in the BeautifulSoup documentation:
The basic reasoning why would you prefer one parser instead of others:
-
html.parser
- built-in - no extra dependencies needed -
html5lib
- the most lenient - better use it if HTML is broken -
lxml
- the fastest
Related videos on Youtube
duc hathaway
Updated on July 17, 2022Comments
-
duc hathaway almost 2 years
When using Beautiful Soup what is the difference between 'lxml' and "html.parser" and "html5lib"?
When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I'd like to strengthen my understanding; I've read a couple posts on here about this but they're not going over the uses much in any at all.
Example:
soup = BeautifulSoup(response.text, 'lxml')
-
kd88 almost 6 yearsThanks -
html5lib
(as a parser of broken HTML) just saved my bacon