Get all text from an XML document?

python xml lxml

13,653

Solution 1

EDIT: This is an answer posted when I thought one-space indentation is normal, and as the comments mention it's not a good answer. Check out the others for some better solutions. This is left here solely for archival reasons, do not follow it!

You asked for lxml:

reslist = list(root.iter())
result = ' '.join([element.text for element in reslist])

Or:

result = ''
for element in root.iter():
    result += element.text + ' '
result = result[:-1] # Remove trailing space

Solution 2

Using stdlib xml.etree

import xml.etree.ElementTree as ET

tree = ET.parse('sample.xml') 
print(ET.tostring(tree.getroot(), encoding='utf-8', method='text'))

Solution 3

I really like BeautifulSoup, and would rather not use regex on HTML if we can avoid it.

Adapted from: [this StackOverflow Answer], [BeautifulSoup documentation]

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)

Though of course, you can (and should) use BeautifulSoup to navigate the page for what you are looking for.

Solution 4

A solution that doesn't require an external library like BeautifulSoup, using the built-in sax parsing framework:

from xml import sax

class MyHandler(sax.handler.ContentHandler):
    def parse(self, filename):
        self.text = []
        sax.parse(filename, self)
        return ''.join(self.text)

    def characters(self, data):
        self.text.append(data)

result = MyHandler().parse("yourfile.xml")

If you need all whitespace intact in the text, also define the ignorableWhitespace method in the handler class in the same way characters is defined.

View more solutions

13,653

Author by

Richard

Updated on June 14, 2022

Comments

Richard almost 2 years

How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.

I'd like to replace XML tags with a single whitespace.