lxml.etree, element.text doesn't return the entire text from an element

12,770

Solution 1

Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation.

Solution 2

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2 

Solution 3

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

Solution 4

Another thing that seems to be working well to get the text out of an element is "".join(element.itertext())

Solution 5

<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])
Share:
12,770

Related videos on Youtube

user522034
Author by

user522034

Updated on May 17, 2022

Comments

  • user522034
    user522034 about 2 years

    I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

    <td> text1 <a> link </a> text2 </td>
    

    but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...

  • user522034
    user522034 over 13 years
    toString(element, method="text") almost works, but it also returns the text of the embedded anchor element, which I don't want.
  • user522034
    user522034 over 13 years
    element.text + child.tail works, but I wish element.text worked the way I want it to :)
  • user522034
    user522034 over 13 years
    element.xpath("string()") returns same result as *.tostring(). I tried xpath("text()") which doesn't return the text of the anchor element, but it returns a list of 2 strings. Thanks for pointing out some stuff though.
  • mmj
    mmj almost 8 years
    It's not a bug, actually it's the feature that allows you to interpose text among subelements when building an XML element: stackoverflow.com/q/38520331/694360
  • Jaap Versteegh
    Jaap Versteegh almost 8 years
    Thanks for pointing that out. I guess that is useful, but imho it would be a lot clearer if .text would just return the full text and some other suitably named property would contain only the part up to the first subelement. How about node.head. This also gives a clue that what you'll want next is child.tail without having to stackoverflow first.
  • Robert Williams
    Robert Williams almost 7 years
    Only pasting code is not enough. You should also explain why it works :)