How to use regular expression in lxml xpath?

21,947

Solution 1

You can do this (although you don't need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class, but it also works for the xpath() method)

doc.xpath("//a[re:match(text(), 'some text')]", 
        namespaces={"re": "http://exslt.org/regular-expressions"})

Note that you need to give the namespace mapping, so that it knows what the "re" prefix in the xpath expression stands for.

Solution 2

You can use the starts-with() function:

doc.xpath("//a[starts-with(text(),'some text')]")

Solution 3

why don't you just use xpath method starts-with here. you can use this to select specific elements that has text starting with your word something like

doc.xpath("//a[starts-with(text(),'some text')]")

note that if you want to select this element as well

<a href="www.example.com">ends with some text2</a>

its text is not starting with some text but it can be included as well using contains method something like

doc.xpath("//a[contains(text(),'some text')]")

Solution 4

Because I can't stand lxml's approach to namespaces, I wrote a little method that you cam bind to the HtmlElement class.

Just import HtmlElement:

from lxml.etree import HtmlElement

Then put this in your file:

# Patch the HtmlElement class to add a function that can handle regular
# expressions within XPath queries.
def re_xpath(self, path):
    return self.xpath(path, namespaces={
        're': 'http://exslt.org/regular-expressions'})
HtmlElement.re_xpath = re_xpath

And then when you want to make a regular expression query, just do:

my_node.re_xpath("//a[re:match(text(), 'some text')]")

And you're off to the races. With a little more work, you could probably modify this to replace the xpath method itself, but I haven't bothered since this is working well enough.

Share:
21,947

Related videos on Youtube

Arty
Author by

Arty

Updated on July 09, 2022

Comments

  • Arty
    Arty almost 2 years

    I'm using construction like this:

    doc = parse(url).getroot()
    links = doc.xpath("//a[text()='some text']")
    

    But I need to select all links which have text beginning with "some text", so I'm wondering is there any way to use regexp here? Didn't find anything in lxml documentation

  • lajarre
    lajarre over 11 years
    Not working for me, I do: match(., 'some text'). By the way I don't quite understand the . part. And func test has the same result (I think it makes more sense to use test actually :P)
  • Ciprian Tomoiagă
    Ciprian Tomoiagă over 7 years
    see this if you are tired of passing the namespaces
  • BSalita
    BSalita almost 4 years
    Import is now from lxml.html import HtmlElement
  • BSalita
    BSalita almost 4 years
    More helpful example with regex searching for <div id=blah_1> tree.re_xpath("//div[re:match(@id, 'blah_\d+')]")
  • Priv Acyplease
    Priv Acyplease over 3 years
    Can I use something like this to override my_node.find()? I want to insert {*}
  • mlissner
    mlissner over 3 years
    I don't know why not, @PrivAcyplease.