lxml.html parsing with XPath and variables
Solution 1
Your first example woks, but probably not how you think it shoud:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")
What this returns is a boolean, which will be true if the condition ...='One'
is true for any of the nodes in the result set at the left side of the xpath expression. And that's why you get the error in your second example: True[0]
is not valid.
You probalby want all nodes matching the expession, having 'One'
as text. The corresponding expression would be:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']")
This returns a nodeset as result, or if you just need the url as a string:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']/@href")
# returns: ['#link1']
Solution 2
I tried mata's response, but for me didn't work:
div_name = 'foo'
my_div = x.xpath(".//div[@id=%s]" %div_name)[0]
I found this on their website http://lxml.de/xpathxslt.html#the-xpath-method for those that might have the same problem :
div_name = 'foo'
my_div = x.xpath(".//div[@id=$name]", name=div_name)[0]
duenni
Updated on February 06, 2020Comments
-
duenni over 4 years
I have this HTML snippet
<div id="dw__toc"> <h3 class="toggle">Table of Contents</h3> <div> <ul class="toc"> <li class="level1"><div class="li"><a href="#section">#</a></div> <ul class="toc"> <li class="level2"><div class="li"><a href="#link1">One</a></div></li> <li class="level2"><div class="li"><a href="#link2">Two</a></div></li> <li class="level2"><div class="li"><a href="#link3">Three</a></div></li>
Now I want to parse it with lxml.html. In the end I want a function where I can provide a searchterm (i.e. "one") and the function should return
One #link1
For now I'm trying to get a variable in the XPath.
Works:
import lxml.html html = lxml.html.parse("www.myurl.com/slash/something") test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'") print test
Trying with variable. I want to replace the hardcoded
'One'
with a variable which I can return to the function later.Doesn't work:
import lxml.html html = lxml.html.parse("www.myurl.com/slash/something") desiredvars = ['One'] myresultset=((var, html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='%s']"%(var))[0]) for var in desiredvars) for each in myresultset: print each Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <genexpr> IndexError: list index out of range
This is based on this answer: https://stackoverflow.com/a/10688235/2320453 Any idea why it doesn't work? Is this the "right way" to do something like this?
EDIT: To sum things up: I want to search within the a-Tags and get the text from this Attributes, but I don't want a complete list instead I want to be able to search with a variable. Pseudo-code:
import lxml.html html = lxml.html.parse("www.myurl.com/slash/something") searchterm = 'one' test=html.xpath("...a/text()=searchterm") print test
Expected result
One #link1
-
duenni about 11 yearsThanks! You're right, my first example prints
True
. Your first example printsElement at 0xc99b90
. How can I bring it to printOne
and replace theOne
in/a[text()='One']
with a variable? I also edited the first post, messed up some brackets in the first place.... -
mata about 11 years
text()
selects a text node, so.../a/text()
yould return a list of all text contenst of all anchors, if that's what you need, or you can use the returned element to access its attributes from python. -
duenni about 11 yearsSo it's better to retrieve a list with all items and then search within that list from python instead of narrowing down the Xpath-expression to only return the one item I'm searching for?
-
duenni about 11 yearsEdited my first post to clarify.
-
mata about 11 yearsif you use something like
".../a[text()=%r]" % searchterm
you get a list of all matching nodes, if you add/@href
you get the href contents, or if you add/text()
you get the text content (which would be pretty much pointles as it's the term you're searching for), always as a list... What's best to use depends on your concrete usecase. -
sebdelsol over 9 years
my_div = x.xpath(".//div[@id='%s']"%div_name)[0]
works fine too