lxml.html parsing with XPath and variables

python html parsing web-scraping lxml

24,587

Solution 1

Your first example woks, but probably not how you think it shoud:

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")

What this returns is a boolean, which will be true if the condition ...='One' is true for any of the nodes in the result set at the left side of the xpath expression. And that's why you get the error in your second example: True[0] is not valid.

You probalby want all nodes matching the expession, having 'One' as text. The corresponding expression would be:

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']")

This returns a nodeset as result, or if you just need the url as a string:

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']/@href")
# returns: ['#link1']

Solution 2

I tried mata's response, but for me didn't work:

div_name = 'foo'
my_div = x.xpath(".//div[@id=%s]" %div_name)[0]

I found this on their website http://lxml.de/xpathxslt.html#the-xpath-method for those that might have the same problem :

div_name = 'foo'
my_div = x.xpath(".//div[@id=$name]", name=div_name)[0]

24,587

Author by

duenni

Updated on February 06, 2020

Comments

duenni over 4 years

I have this HTML snippet

<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>

<ul class="toc">
<li class="level1"><div class="li"><a href="#section">#</a></div>
<ul class="toc">
<li class="level2"><div class="li"><a href="#link1">One</a></div></li>
<li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
<li class="level2"><div class="li"><a href="#link3">Three</a></div></li>

Now I want to parse it with lxml.html. In the end I want a function where I can provide a searchterm (i.e. "one") and the function should return

One
#link1

For now I'm trying to get a variable in the XPath.

Works:

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")

print test

Trying with variable. I want to replace the hardcoded 'One' with a variable which I can return to the function later.

Doesn't work:

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

desiredvars = ['One']
myresultset=((var, html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='%s']"%(var))[0]) for var in desiredvars)

for each in myresultset: 
        print each

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
IndexError: list index out of range

This is based on this answer: https://stackoverflow.com/a/10688235/2320453 Any idea why it doesn't work? Is this the "right way" to do something like this?

EDIT: To sum things up: I want to search within the a-Tags and get the text from this Attributes, but I don't want a complete list instead I want to be able to search with a variable. Pseudo-code:

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

searchterm = 'one'

test=html.xpath("...a/text()=searchterm")

print test

Expected result

One
#link1

duenni about 11 years

Thanks! You're right, my first example prints True. Your first example prints Element at 0xc99b90. How can I bring it to print One and replace the One in /a[text()='One'] with a variable? I also edited the first post, messed up some brackets in the first place....
mata about 11 years

text() selects a text node, so .../a/text() yould return a list of all text contenst of all anchors, if that's what you need, or you can use the returned element to access its attributes from python.
duenni about 11 years

So it's better to retrieve a list with all items and then search within that list from python instead of narrowing down the Xpath-expression to only return the one item I'm searching for?
duenni about 11 years

Edited my first post to clarify.
mata about 11 years

if you use something like ".../a[text()=%r]" % searchterm you get a list of all matching nodes, if you add /@href you get the href contents, or if you add /text() you get the text content (which would be pretty much pointles as it's the term you're searching for), always as a list... What's best to use depends on your concrete usecase.
sebdelsol over 9 years

my_div = x.xpath(".//div[@id='%s']"%div_name)[0] works fine too