can we use XPath with BeautifulSoup?
Solution 1
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath()
method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
There is also a dedicated lxml.html()
module with additional functionality.
Note that in the above example I passed the response
object directly to lxml
, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests
library, you want to set stream=True
and pass in the response.raw
object after enabling transparent transport decompression:
import lxml.html
import requests
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector
class translates CSS statements into XPath expressions, making your search for td.empformbody
that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
Solution 2
I can confirm that there is no XPath support within Beautiful Soup.
Solution 3
As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I used this as a reference.
Solution 4
BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]
Solution 5
from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath
Shiva Krishna Bavandla
I love to work on python and django using jquery and ajax.
Updated on November 20, 2021Comments
-
Shiva Krishna Bavandla over 2 years
I am using BeautifulSoup to scrape an URL and I had the following code, to find the
td
tag whose class is'empformbody'
:import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" req = urllib2.Request(url) response = urllib2.urlopen(req) the_page = response.read() soup = BeautifulSoup(the_page) soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use
findAll
to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code. -
Shiva Krishna Bavandla almost 12 yearsyes actually until now i used scrapy which uses xpath to fetch the data inside tags.Its very handy and easy to fetch data, but i got a need to do the same with beautifulsoup so looking forward in to it.
-
Shiva Krishna Bavandla almost 12 yearsThanks very much Pieters, i got two informations from ur code,1. A clarification that we can't use xpath with BS 2.A nice example on how using lxml. Can we see it on a particular documentation that "we can't implement xpath using BS in written form", because we should show some proof to someone those who ask for clarification right?
-
Martijn Pieters almost 12 yearsIt's hard to prove a negative; the BeautifulSoup 4 documentation has a search function and there are no hits for 'xpath'.
-
senshin almost 10 yearsNote: Leonard Richardson is the author of Beautiful Soup, as you'll see if you click through to his user profile.
-
DarthOpto over 9 yearsIt would be very nice to be able to use XPATH within BeautifulSoup
-
static_rtti about 7 yearsSo what is the alternative?
-
wordsforthewise over 5 yearsOne warning: I've noticed if there is something outside the root (like a \n outside the outer <html> tags), then referencing xpaths by the root will not work, you have to use relative xpaths. lxml.de/xpathxslt.html
-
robertspierre over 5 yearsI believe this only finds the child elements. XPath is another thing?
-
Martijn Pieters over 4 yearsMartijn's code no longer works properly (it is 4+ years old by now...), the etree.parse() line prints to the console and doesn't assign the value to the tree variable. That's quite a claim. I certainly can't reproduce that, and it would not make any sense. Are you sure you are using Python 2 to test my code with, or have translated the
urllib2
library use to Python 3urllib.request
? -
wordsforthewise over 4 yearsYeah, that may be the case that I used Python3 when writing that and it didn't work as expected. Just tested and yours works with Python2, but Python3 is much preferred as 2 is being sunset (no longer officially supported) in 2020.
-
Martijn Pieters over 4 yearsabsolutely agree, but the question here uses Python 2.
-
AMC almost 4 years
select()
is for CSS selectors, it's not XPath at all. as you see, this does not support sub-tag While I'm not sure if that was true at the time, it certainly isn't now. -
Zvi over 3 yearsI tried running your code above but got an error "name 'xpathselector' is not defined"
-
Martijn Pieters over 3 years@Zvi the code doesn’t define an Xpath selector; I meant it to be read as “use your own XPath expression here”.
-
mshaffer about 3 years@leonard-richardson It's 2021, are you still confirming that BeautifulSoup STILL does not have xpath support?