Python: Unicode and ElementTree.parse
Solution 1
Can't you use
doc = ET.fromstring(source)
in your first example ?
Solution 2
Your problem is that you are feeding ElementTree
unicode, but it prefers to consume bytes. It will provide you with unicode in any case.
In Python 2.x, it can only consume bytes. You can tell it what encoding those bytes are in, but that's it. So, if you literally have to work with an object that represents a text file, like io.StringIO
, first you will need to convert it into something else.
If you are literally starting with a 2.x-str
(AKA bytes
) in UTF-8 encoding, in memory, as in your example, use xml.etree.cElementTree.XML
to parse it into XML in one fell swoop and don't worry about any of this :-).
If you want an interface that can deal with data that is incrementally read from a file, use xml.etree.cElementTree.parse
with an io.BytesIO
to convert it into an in-memory stream of bytes rather than an in-memory string of characters. If you want to use io.open
, use it with the b
flag, so that you get streams of bytes.
In Python 3.x, you can pass unicode directly in to ElementTree, which is a bit more convenient, and arguably the newer version of ElementTree is more correct to allow this. However, you still might not want to, and Python 3's version does still accept bytes as input. You're always starting with bytes anyway: by passing them directly from your input source to ElementTree, you get to let it do its encoding or decoding intelligently inside the XML parsing engine, as well as do on-the-fly detection of encoding declarations within the input stream, which you can do with XML but you can't do with arbitrary textual data. So letting the XML parser do the work of decoding is the right place to put that responsibility.
Solution 3
I encountered the same problem as you in Python 2.6.
It seems that "utf-8" encoding for cElementTree.parse in Python 2.x and 3.x version are different. In Python 2.x, we can use XMLParser to encode the unicode. For example:
import xml.etree.cElementTree as etree
parser = etree.XMLParser(encoding="utf-8")
targetTree = etree.parse( "./targetPageID.xml", parser=parser )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)
You can refer to this page for the XMLParser method (Section "XMLParser"): http://effbot.org/zone/elementtree-13-intro.htm
While the following method works for Python 3.x version:
import xml.etree.cElementTree as etree
import codecs
target_file = codecs.open("./targetPageID.xml",mode='r',encoding='utf-8')
targetTree = etree.parse( target_file )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)
Hope this can help you.
Santa
Updated on September 10, 2020Comments
-
Santa over 3 years
I'm trying to move to Python 2.7 and since Unicode is a Big Deal there, I'd try dealing with them with XML files and texts and parse them using the
xml.etree.cElementTree
library. But I ran across this error:>>> import xml.etree.cElementTree as ET >>> from io import StringIO >>> source = """\ ... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> ... <root> ... <Parent> ... <Child> ... <Element>Text</Element> ... </Child> ... </Parent> ... </root> ... """ >>> srcbuf = StringIO(source.decode('utf-8')) >>> doc = ET.parse(srcbuf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 56, in parse File "<string>", line 35, in parse cElementTree.ParseError: no element found: line 1, column 0
The same thing happens using
io.open('filename.xml', encoding='utf-8')
to pass toET.parse
:>>> with io.open('test.xml', mode='w', encoding='utf-8') as fp: ... fp.write(source.decode('utf-8')) ... 150L >>> with io.open('test.xml', mode='r', encoding='utf-8') as fp: ... fp.read() ... u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>\n<root>\n <Parent>\n <Child>\n <Element>Text</Element>\n </Child>\n </Parent>\n</root>\n ' >>> with io.open('test.xml', mode='r', encoding='utf-8') as fp: ... ET.parse(fp) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "<string>", line 56, in parse File "<string>", line 35, in parse cElementTree.ParseError: no element found: line 1, column 0
Is there something about unicode and ET parsing that I am missing here?
edit: Apparently, the ET parser does not play well with unicode input stream? The following works:
>>> with io.open('test.xml', mode='rb') as fp: ... ET.parse(fp) ... <ElementTree object at 0x0180BC10>
But this also means I cannot use
io.StringIO
if I want to parse from an in-memory text, unless I encode it first into an in-memory buffer? -
Santa over 13 yearsI didn't realize that function existed. Minor point though:
fromstring
returns anElement
, whileparse
returns anElementTree
. -
Glyph over 11 yearsThis function is also known as its alias,
XML
, i.e.from xml.etree.cElementTree import XML
. This alias is there so that it will read nicely if you have an XML constant in your code; you can just dofooDocument = XML(""" ... """)
. -
Glyph over 11 yearsThis is sort of right, but the ElementTree versions are more similar between Python 2 and Python 3 than you give them credit for. I'll write up a different answer, I think.
-
Saransh Singh almost 3 yearsThe
xml.etree.cElementTree
module is deprecated since python 3.3. See docs.python.org/3/library/xml.etree.elementtree.html