xpath expression to remove whitespace

88,265

Solution 1

I. Use this single XPath expression:

translate(normalize-space(/tr/td/a), ' ', '')

Explanation:

  1. normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.

  2. translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.


II. Alternatively:

translate(/tr/td/a, ' 	
&#13', '')

Solution 2

Please try the below xpath expression :

//td[@class='score-time status']/a[normalize-space() = '16 : 00']

Solution 3

You can use XPath's normalize-space() as in //a[normalize-space()="16 : 00"]

Solution 4

I came across this thread when I was having my own issue similar to above.

HTML

<div class="d-flex">
<h4 class="flex-auto min-width-0 pr-2 pb-1 commit-title">
  <a href="/nsomar/OAStackView/releases/tag/1.0.1">

    1.0.1
  </a>

XPath start command

tree.xpath('//div[@class="d-flex"]/h4/a/text()')

However this grabbed random whitespace and gave me the output of:

['\n          ', '\n        1.0.1\n      ']

Using normalize-space, it removed the first blank space node and left me with just what I wanted

tree.xpath('//div[@class="d-flex"]/h4/a/text()[normalize-space()]')

['\n        1.0.1\n      ']

I could then grab the first element of the list, and use strip() to remove any further whitespace

XPath final command

tree.xpath('//div[@class="d-flex"]/h4/a/text()[normalize-space()]')[0].strip()

Which left me with exactly what I required:

1.0.1

Solution 5

  • you can check if text() nodes are empty.

    /path/text()[not(.='')]

it may be useful with axes like following-sibling:: if these are no containers, or with child::.

  • you can use string() or the regex() function of xpath 2.

NOTE: some comments say that xpath cannot do string manipulation... even if it's not really designed for that you can do basic things: contains(), starts-with(), replace().

if you want to check whitespace nodes it's much harder, as you will generally have a nodelist result set, and most xpath functions, like match or replace, only operate one node.

  • you can separate node and string manipulation

So you may use xpath to retrieve a container, or a list of text nodes, and then process it with another language. (java, php, python, perl for instance).

Share:
88,265
adellam
Author by

adellam

Updated on July 08, 2022

Comments

  • adellam
    adellam almost 2 years

    I have this HTML:

     <tr class="even  expanded first>
       <td class="score-time status">
         <a href="/matches/2012/08/02/europe/uefa-cup/">
    
                16 : 00
    
         </a>
        </td>        
      </tr>
    

    I want to extract the (16 : 00) string without the extra whitespace. Is this possible?

  • Arup Rakshit
    Arup Rakshit almost 10 years
    Is there a shortest XPATH expression to get only the CDATA nodes though an XML file ?
  • Dimitre Novatchev
    Dimitre Novatchev almost 10 years
    @ArupRakshit, There are no "CDATA nodes" in the XPath Data Model and thus it is not possible to distinguish CDATA as part of the text node that contains it. The same way as it is not possible to know if the short tag was used for an element without children, or if quotes or apostrophes were used as delimiters around an attribute value.
  • Arup Rakshit
    Arup Rakshit almost 10 years
    @DimitreNovatchev Thanks for the reply. So it means, I need to find it , they way, I search for the regular nodes.
  • Dimitre Novatchev
    Dimitre Novatchev almost 10 years
    @ArupRakshit, Yes, one can only select whole text nodes in XPath. You could filter these nodes with predicate(s) if you know something more (like a substring) for the text you are looking for