XPath to locate a cell with specific text parsing HTML tables

23,843

Solution 1

Use this XPath:

//td[contains(., 'Chapter')]

Solution 2

You want all tds under your current node -- not - all in the document as the currently accepted answer selects.

Use:

.//td[.//text()[contains(., 'Chapter')]]

This selects all td descendants of the current node that are named td that have at least one text node descendant, whose string value contains the string "Chapter".

If it is known in advance that any td under this table only has a single text node, this can be simplified to just:

.//td[contains(., 'Chapter')]

Solution 3

Your on the right "path".
The contains() function is limited the a specific element, not text in any of the children. Try this XPath, which you could read as follows: - get every tr/td with any sub element that contains the text 'Chapter'

tr/td[contains(*,"Chapter")]

Good luck

Share:
23,843
David Brown
Author by

David Brown

Expert Web Developer Founder Tucanoo Solutions Ltd : https://www.tucanoo.com Grails Development Specialists

Updated on February 26, 2021

Comments

  • David Brown
    David Brown about 3 years

    Hope someone out there can quickly point me in the right direction with my XPath difficulties.

    Current I've got to the point where I'm identifying the correct table i need in my HTML source but then I need to process only the rows that have the text 'Chapter' somewhere in the DOM.

    My last attempt was to do this :

    // get the correct table
    HtmlTable table = page.getFirstByXPath("//table[2]");
    
    // now the failing bit....
    def rows = table.getByXPath("*/td[contains(text(),'Chapter')]") 
    

    I thought the xpath above would represent, get me all elements that have a following child element of 'td' that somewhere in its dom contains the text 'Chapter'

    An example of a matching row from my source is :

    <tr valign="top">
      <td nowrap="" align="Right">
       <font face="Verdana">
       <a href="index.cfm?a=1">Chapter 1</a>
       </font>
      </td>
      <td class="ChapterT">
        <font face="Verdana">DEFINITIONS</font>
      </td>
      <td>&nbsp;</td>
    </tr>
    

    Any help / pointers greatly appreciated.

    Thanks,

  • David Brown
    David Brown about 12 years
    Hi William, gave it a go but couldn't get it to return anything. What has worked, although doesn't seem the most efficient is a single liner of ' def chapterAnchors = page.anchors.findAll {HtmlAnchor a -> a.asText().contains('Chapter')} '
  • David Brown
    David Brown about 12 years
    Thanks, that appears to work. What does the '.' represent? Also I don't understand why the 'reletive' detection isn't working, e.g. you have the // which as I understand means begin at the root?
  • Kirill Polishchuk
    Kirill Polishchuk about 12 years
    @Dave, You're welcome. . and // is XPath abbreviated syntax. . selects the context node. //td selects all the td descendants of the document root and thus selects all td elements in the same document as the context node. Reference: w3.org/TR/xpath/#path-abbrev