Get text content of an HTML element using XPath?
68,853
You want to select all descendant text, not just child text:
//div[a[contains(., "Add to cart")]]/p//text()
Note the double slash between p
and text()
there.
This potentially will also include a lot of inter-tag whitespace though, you you'll need to clean that up. Example using lxml
:
>>> import lxml.etree as ET
>>> tree = ET.fromstring('''<div>
... <div>
... <p>
... <span class="abc">Monitor</span> <b>$300</b>
... </p>
... <a href="/add">Add to cart</a>
... </div>
... <div>
... <p>
... <span class="abc">Keyboard</span> $20
... </p>
... <a href="/add">Add to cart</a>
... </div>
... </div>''')
>>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()')
['\n ', 'Monitor', ' ', '$300', '\n ', '\n ', 'Keyboard', ' $20 \n ']
>>> res = _
>>> [txt for txt in (txt.strip() for txt in res) if txt]
['Monitor', '$300', 'Keyboard', '$20']
Comments
-
Genghis Khan almost 4 years
See this html
<div> <p> <span class="abc">Monitor</span> <b>$300</b> </p> <a href="/add">Add to cart</a> </div> <div> <p> <span class="abc">Keyboard</span> $20 </p> <a href="/add">Add to cart</a> </div>
Using xpath I want to parse
Monitor $300
andKeyboard $20
. I use this xpath//div[a[contains(., "Add to cart")]]/p/text()
But it selects
<span class="abc">Monitor</span> <b>$300</b>
. I don't want the tags. How do I get only the text?