Python: Unicode and ElementTree.parse

31,190

Solution 1

Can't you use

doc = ET.fromstring(source)

in your first example ?

Solution 2

Your problem is that you are feeding ElementTree unicode, but it prefers to consume bytes. It will provide you with unicode in any case.

In Python 2.x, it can only consume bytes. You can tell it what encoding those bytes are in, but that's it. So, if you literally have to work with an object that represents a text file, like io.StringIO, first you will need to convert it into something else.

If you are literally starting with a 2.x-str (AKA bytes) in UTF-8 encoding, in memory, as in your example, use xml.etree.cElementTree.XML to parse it into XML in one fell swoop and don't worry about any of this :-).

If you want an interface that can deal with data that is incrementally read from a file, use xml.etree.cElementTree.parse with an io.BytesIO to convert it into an in-memory stream of bytes rather than an in-memory string of characters. If you want to use io.open, use it with the b flag, so that you get streams of bytes.

In Python 3.x, you can pass unicode directly in to ElementTree, which is a bit more convenient, and arguably the newer version of ElementTree is more correct to allow this. However, you still might not want to, and Python 3's version does still accept bytes as input. You're always starting with bytes anyway: by passing them directly from your input source to ElementTree, you get to let it do its encoding or decoding intelligently inside the XML parsing engine, as well as do on-the-fly detection of encoding declarations within the input stream, which you can do with XML but you can't do with arbitrary textual data. So letting the XML parser do the work of decoding is the right place to put that responsibility.

Solution 3

I encountered the same problem as you in Python 2.6.

It seems that "utf-8" encoding for cElementTree.parse in Python 2.x and 3.x version are different. In Python 2.x, we can use XMLParser to encode the unicode. For example:

import xml.etree.cElementTree as etree

parser = etree.XMLParser(encoding="utf-8")
targetTree = etree.parse( "./targetPageID.xml", parser=parser )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

You can refer to this page for the XMLParser method (Section "XMLParser"): http://effbot.org/zone/elementtree-13-intro.htm

While the following method works for Python 3.x version:

import xml.etree.cElementTree as etree
import codecs

target_file = codecs.open("./targetPageID.xml",mode='r',encoding='utf-8')

targetTree = etree.parse( target_file )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

Hope this can help you.

Share:
31,190
Santa
Author by

Santa

Updated on September 10, 2020

Comments

  • Santa
    Santa over 3 years

    I'm trying to move to Python 2.7 and since Unicode is a Big Deal there, I'd try dealing with them with XML files and texts and parse them using the xml.etree.cElementTree library. But I ran across this error:

    >>> import xml.etree.cElementTree as ET
    >>> from io import StringIO
    >>> source = """\
    ... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
    ... <root>
    ...   <Parent>
    ...     <Child>
    ...       <Element>Text</Element>
    ...     </Child>
    ...   </Parent>
    ... </root>
    ... """
    >>> srcbuf = StringIO(source.decode('utf-8'))
    >>> doc = ET.parse(srcbuf)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 56, in parse
      File "<string>", line 35, in parse
    cElementTree.ParseError: no element found: line 1, column 0
    

    The same thing happens using io.open('filename.xml', encoding='utf-8') to pass to ET.parse:

    >>> with io.open('test.xml', mode='w', encoding='utf-8') as fp:
    ...     fp.write(source.decode('utf-8'))
    ...
    150L
    >>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
    ...     fp.read()
    ...
    u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>\n<root>\n  <Parent>\n
        <Child>\n      <Element>Text</Element>\n    </Child>\n  </Parent>\n</root>\n
    '
    >>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
    ...     ET.parse(fp)
    ...
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "<string>", line 56, in parse
      File "<string>", line 35, in parse
    cElementTree.ParseError: no element found: line 1, column 0
    

    Is there something about unicode and ET parsing that I am missing here?

    edit: Apparently, the ET parser does not play well with unicode input stream? The following works:

    >>> with io.open('test.xml', mode='rb') as fp:
    ...     ET.parse(fp)
    ...
    <ElementTree object at 0x0180BC10>
    

    But this also means I cannot use io.StringIO if I want to parse from an in-memory text, unless I encode it first into an in-memory buffer?

  • Santa
    Santa over 13 years
    I didn't realize that function existed. Minor point though: fromstring returns an Element, while parse returns an ElementTree.
  • Glyph
    Glyph over 11 years
    This function is also known as its alias, XML, i.e. from xml.etree.cElementTree import XML. This alias is there so that it will read nicely if you have an XML constant in your code; you can just do fooDocument = XML(""" ... """).
  • Glyph
    Glyph over 11 years
    This is sort of right, but the ElementTree versions are more similar between Python 2 and Python 3 than you give them credit for. I'll write up a different answer, I think.
  • Saransh Singh
    Saransh Singh almost 3 years
    The xml.etree.cElementTree module is deprecated since python 3.3. See docs.python.org/3/library/xml.etree.elementtree.html