can we use XPath with BeautifulSoup?

python web-scraping xpath beautifulsoup urllib

242,076

Solution 1

Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

Solution 2

I can confirm that there is no XPath support within Beautiful Soup.

Solution 3

As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

I used this as a reference.

Solution 4

BeautifulSoup has a function named findNext from current element directed childern,so:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

Above code can imitate the following xpath:

div[class=class_value]/div[id=id_value]

Solution 5

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

Above used the combination of Soup object with lxml and one can extract the value using xpath

View more solutions

242,076

Author by

Shiva Krishna Bavandla

I love to work on python and django using jquery and ajax.

Updated on November 20, 2021

Comments

Shiva Krishna Bavandla over 2 years
I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
```
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)

soup.findAll('td',attrs={'class':'empformbody'})
```
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.
Shiva Krishna Bavandla almost 12 years

yes actually until now i used scrapy which uses xpath to fetch the data inside tags.Its very handy and easy to fetch data, but i got a need to do the same with beautifulsoup so looking forward in to it.
Shiva Krishna Bavandla almost 12 years

Thanks very much Pieters, i got two informations from ur code,1. A clarification that we can't use xpath with BS 2.A nice example on how using lxml. Can we see it on a particular documentation that "we can't implement xpath using BS in written form", because we should show some proof to someone those who ask for clarification right?
Martijn Pieters almost 12 years

It's hard to prove a negative; the BeautifulSoup 4 documentation has a search function and there are no hits for 'xpath'.
senshin almost 10 years

Note: Leonard Richardson is the author of Beautiful Soup, as you'll see if you click through to his user profile.
DarthOpto over 9 years

It would be very nice to be able to use XPATH within BeautifulSoup
static_rtti about 7 years

So what is the alternative?
wordsforthewise over 5 years

One warning: I've noticed if there is something outside the root (like a \n outside the outer <html> tags), then referencing xpaths by the root will not work, you have to use relative xpaths. lxml.de/xpathxslt.html
robertspierre over 5 years

I believe this only finds the child elements. XPath is another thing?
Martijn Pieters over 4 years

Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse() line prints to the console and doesn't assign the value to the tree variable. That's quite a claim. I certainly can't reproduce that, and it would not make any sense. Are you sure you are using Python 2 to test my code with, or have translated the urllib2 library use to Python 3 urllib.request?
wordsforthewise over 4 years

Yeah, that may be the case that I used Python3 when writing that and it didn't work as expected. Just tested and yours works with Python2, but Python3 is much preferred as 2 is being sunset (no longer officially supported) in 2020.
Martijn Pieters over 4 years

absolutely agree, but the question here uses Python 2.
AMC almost 4 years

select() is for CSS selectors, it's not XPath at all. as you see, this does not support sub-tag While I'm not sure if that was true at the time, it certainly isn't now.
Zvi over 3 years

I tried running your code above but got an error "name 'xpathselector' is not defined"
Martijn Pieters over 3 years

@Zvi the code doesn’t define an Xpath selector; I meant it to be read as “use your own XPath expression here”.
mshaffer about 3 years

@leonard-richardson It's 2021, are you still confirming that BeautifulSoup STILL does not have xpath support?