lxml.etree, element.text doesn't return the entire text from an element

python xml lxml elementtree xml.etree

12,770

Solution 1

Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation.

Solution 2

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2

Solution 3

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

Solution 4

Another thing that seems to be working well to get the text out of an element is "".join(element.itertext())

Solution 5

<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])

View more solutions

12,770

user522034

Updated on May 17, 2022

Comments

user522034 about 2 years
I scrapped some html via xpath, that I then converted into an etree. Something similar to this:
```
<td> text1 <a> link </a> text2 </td>
```
but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...
user522034 over 13 years

toString(element, method="text") almost works, but it also returns the text of the embedded anchor element, which I don't want.
user522034 over 13 years

element.text + child.tail works, but I wish element.text worked the way I want it to :)
user522034 over 13 years

element.xpath("string()") returns same result as *.tostring(). I tried xpath("text()") which doesn't return the text of the anchor element, but it returns a list of 2 strings. Thanks for pointing out some stuff though.
mmj almost 8 years

It's not a bug, actually it's the feature that allows you to interpose text among subelements when building an XML element: stackoverflow.com/q/38520331/694360
Jaap Versteegh almost 8 years

Thanks for pointing that out. I guess that is useful, but imho it would be a lot clearer if .text would just return the full text and some other suitably named property would contain only the part up to the first subelement. How about node.head. This also gives a clue that what you'll want next is child.tail without having to stackoverflow first.
Robert Williams almost 7 years

Only pasting code is not enough. You should also explain why it works :)