Python element tree - extract text from element, stripping tags
Solution 1
If you are running under Python 3.2+, you can use itertext
.
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
Solution 2
As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text
and tail
attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree
and lxml
on PyPI) can do this for you automatically in the tostring
method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'\n Some example text\n'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the text
s and tail
s. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None
. For example, here's a skeleton you can hook your own code on:
def textify(t):
s = []
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text
and tail
are guaranteed to be a str
or None
. For trees you build up manually, that's not guaranteed to be true.
Brandon
Professional full-stack software engineer, mostly working on distributed systems and Web applications.
Updated on June 04, 2022Comments
-
Brandon about 2 years
With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag> Some <a>example</a> text </tag>
I want to return
Some example text
. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes. -
CodeMonkey about 8 yearsThank you, was searching for this for a while!
-
Tomalak over 5 yearsn.b.: This applies only to
lxml
. Thexml.etree
package does not know enough XPath to do this.