Get all text from an XML document?

13,653

Solution 1

EDIT: This is an answer posted when I thought one-space indentation is normal, and as the comments mention it's not a good answer. Check out the others for some better solutions. This is left here solely for archival reasons, do not follow it!

You asked for lxml:

reslist = list(root.iter())
result = ' '.join([element.text for element in reslist]) 

Or:

result = ''
for element in root.iter():
    result += element.text + ' '
result = result[:-1] # Remove trailing space

Solution 2

Using stdlib xml.etree

import xml.etree.ElementTree as ET

tree = ET.parse('sample.xml') 
print(ET.tostring(tree.getroot(), encoding='utf-8', method='text'))

Solution 3

I really like BeautifulSoup, and would rather not use regex on HTML if we can avoid it.

Adapted from: [this StackOverflow Answer], [BeautifulSoup documentation]

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)

Though of course, you can (and should) use BeautifulSoup to navigate the page for what you are looking for.

Solution 4

A solution that doesn't require an external library like BeautifulSoup, using the built-in sax parsing framework:

from xml import sax

class MyHandler(sax.handler.ContentHandler):
    def parse(self, filename):
        self.text = []
        sax.parse(filename, self)
        return ''.join(self.text)

    def characters(self, data):
        self.text.append(data)

result = MyHandler().parse("yourfile.xml")

If you need all whitespace intact in the text, also define the ignorableWhitespace method in the handler class in the same way characters is defined.

Share:
13,653
Richard
Author by

Richard

Updated on June 14, 2022

Comments

  • Richard
    Richard almost 2 years

    How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python.

    I'd like to replace XML tags with a single whitespace.