lxml.html parsing with XPath and variables

24,587

Solution 1

Your first example woks, but probably not how you think it shoud:

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")

What this returns is a boolean, which will be true if the condition ...='One' is true for any of the nodes in the result set at the left side of the xpath expression. And that's why you get the error in your second example: True[0] is not valid.

You probalby want all nodes matching the expession, having 'One' as text. The corresponding expression would be:

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']")

This returns a nodeset as result, or if you just need the url as a string:

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']/@href")
# returns: ['#link1']

Solution 2

I tried mata's response, but for me didn't work:

div_name = 'foo'
my_div = x.xpath(".//div[@id=%s]" %div_name)[0]

I found this on their website http://lxml.de/xpathxslt.html#the-xpath-method for those that might have the same problem :

div_name = 'foo'
my_div = x.xpath(".//div[@id=$name]", name=div_name)[0]
Share:
24,587
duenni
Author by

duenni

Updated on February 06, 2020

Comments

  • duenni
    duenni over 4 years

    I have this HTML snippet

    <div id="dw__toc">
    <h3 class="toggle">Table of Contents</h3>
    <div>
    
    <ul class="toc">
    <li class="level1"><div class="li"><a href="#section">#</a></div>
    <ul class="toc">
    <li class="level2"><div class="li"><a href="#link1">One</a></div></li>
    <li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
    <li class="level2"><div class="li"><a href="#link3">Three</a></div></li>
    

    Now I want to parse it with lxml.html. In the end I want a function where I can provide a searchterm (i.e. "one") and the function should return

    One
    #link1
    

    For now I'm trying to get a variable in the XPath.

    Works:

    import lxml.html
    html = lxml.html.parse("www.myurl.com/slash/something")
    
    test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")
    
    print test
    

    Trying with variable. I want to replace the hardcoded 'One' with a variable which I can return to the function later.

    Doesn't work:

    import lxml.html
    html = lxml.html.parse("www.myurl.com/slash/something")
    
    desiredvars = ['One']
    myresultset=((var, html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='%s']"%(var))[0]) for var in desiredvars)
    
    for each in myresultset: 
            print each
    
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<stdin>", line 1, in <genexpr>
    IndexError: list index out of range
    

    This is based on this answer: https://stackoverflow.com/a/10688235/2320453 Any idea why it doesn't work? Is this the "right way" to do something like this?

    EDIT: To sum things up: I want to search within the a-Tags and get the text from this Attributes, but I don't want a complete list instead I want to be able to search with a variable. Pseudo-code:

    import lxml.html
    html = lxml.html.parse("www.myurl.com/slash/something")
    
    searchterm = 'one'
    
    test=html.xpath("...a/text()=searchterm")
    
    print test
    

    Expected result

    One
    #link1
    
  • duenni
    duenni about 11 years
    Thanks! You're right, my first example prints True. Your first example prints Element at 0xc99b90. How can I bring it to print One and replace the One in /a[text()='One'] with a variable? I also edited the first post, messed up some brackets in the first place....
  • mata
    mata about 11 years
    text() selects a text node, so .../a/text() yould return a list of all text contenst of all anchors, if that's what you need, or you can use the returned element to access its attributes from python.
  • duenni
    duenni about 11 years
    So it's better to retrieve a list with all items and then search within that list from python instead of narrowing down the Xpath-expression to only return the one item I'm searching for?
  • duenni
    duenni about 11 years
    Edited my first post to clarify.
  • mata
    mata about 11 years
    if you use something like ".../a[text()=%r]" % searchterm you get a list of all matching nodes, if you add /@href you get the href contents, or if you add /text() you get the text content (which would be pretty much pointles as it's the term you're searching for), always as a list... What's best to use depends on your concrete usecase.
  • sebdelsol
    sebdelsol over 9 years
    my_div = x.xpath(".//div[@id='%s']"%div_name)[0] works fine too