lxml.etree, element.text doesn't return the entire text from an element
Solution 1
Use element.xpath("string()")
or lxml.etree.tostring(element, method="text")
- see the documentation.
Solution 2
As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.
from lxml import etree
def get_text1(node):
result = node.text or ""
for child in node:
if child.tail is not None:
result += child.tail
return result
def get_text2(node):
return ((node.text or '') +
''.join(map(get_text2, node)) +
(node.tail or ''))
def get_text3(node):
return (node.text or "") + "".join(
[etree.tostring(child) for child in node.iterchildren()])
root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")
print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)
Output is:
snowy:rpg$ python test.py
[' text1 ', ' text2 ']
text1 text2
text1 link text2
text1 link text2
text1 link text2
<td> text1 <a> link </a> text2 </td>
text1 <a> link </a> text2
Solution 3
looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:
def node_text(node):
if node.text:
result = node.text
else:
result = ''
for child in node:
if child.tail is not None:
result += child.tail
return result
Solution 4
Another thing that seems to be working well to get the text out of an element is "".join(element.itertext())
Solution 5
<td> text1 <a> link </a> text2 </td>
Here's how it is (ignoring whitespace):
td.text == 'text1'
a.text == 'link'
a.tail == 'text2'
If you don't want a text that is inside child elements then you could collect only their tails:
text = td.text + ''.join([el.tail for el in td])
Related videos on Youtube
user522034
Updated on May 17, 2022Comments
-
user522034 about 2 years
I scrapped some html via xpath, that I then converted into an etree. Something similar to this:
<td> text1 <a> link </a> text2 </td>
but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...
-
user522034 over 13 yearstoString(element, method="text") almost works, but it also returns the text of the embedded anchor element, which I don't want.
-
user522034 over 13 yearselement.text + child.tail works, but I wish element.text worked the way I want it to :)
-
user522034 over 13 yearselement.xpath("string()") returns same result as *.tostring(). I tried xpath("text()") which doesn't return the text of the anchor element, but it returns a list of 2 strings. Thanks for pointing out some stuff though.
-
mmj almost 8 yearsIt's not a bug, actually it's the feature that allows you to interpose text among subelements when building an XML element: stackoverflow.com/q/38520331/694360
-
Jaap Versteegh almost 8 yearsThanks for pointing that out. I guess that is useful, but imho it would be a lot clearer if
.text
would just return the full text and some other suitably named property would contain only the part up to the first subelement. How aboutnode.head
. This also gives a clue that what you'll want next ischild.tail
without having to stackoverflow first. -
Robert Williams almost 7 yearsOnly pasting code is not enough. You should also explain why it works :)