Pandas read_xml() method test strategies
PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?
I would assume that more python code would make it slower, as the python code is evaluated every time. Have you tried a JIT compiler like pypy?
If I remove i
and use first_tag
only, it seems to be quite a bit faster, so yes it is partly due to the if logic checks:
def read_xml_iterparse2(path):
data = []
inner = {}
first_tag = None
for (ev, el) in et.iterparse(path):
if not first_tag:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
if el.text is not None and len(el.text.strip()) > 0:
inner[el.tag] = el.text
df = pd.DataFrame(data)
%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 33 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 23 ms per loop
I wasn't sure I understood the purpose of the last if
check, but I'm also not sure why you would want to lose whitespace-only elements. Removing the last if
consistently shaves off a little bit of time:
def read_xml_iterparse3(path):
data = []
inner = {}
first_tag = None
for (ev, el) in et.iterparse(path):
if not first_tag:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
inner[el.tag] = el.text
df = pd.DataFrame(data)
%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 34.4 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 24.5 ms per loop
%timeit read_xml_iterparse3(path)
# 10 loops, best of 5: 20.9 ms per loop
Now, with or without those performance improvements, your iterparse version seems to produce an extra-large dataframe. Here seems to be a working, fast version:
def read_xml_iterparse5(path):
data = []
inner = {}
for (ev, el) in et.iterparse(path):
# /ending parents trigger a new row, and in our case .text is \n followed by spaces. it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check
if el.text and el.text[0] == '\n':
# ignore /stackoverflow
if inner:
data.append(inner)
inner = {}
else:
inner[el.tag] = el.text
return pd.DataFrame(data)
print(read_xml_iterfind(path).shape)
# (900, 8)
print(read_xml_iterparse(path).shape)
# (7050, 8)
print(read_xml_lxml_xpath(path).shape)
# (900, 8)
print(read_xml_lxml_xsl(path).shape)
# (900, 8)
print(read_xml_iterparse5(path).shape)
# (900, 8)
%timeit read_xml_iterparse5(path)
# 10 loops, best of 5: 20.6 ms per loop
MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.
I'm not totally sure what you mean by "I/O calls" but if your document is small enough to fit in cache, then everything will be much faster as it won't evict many other items from the cache.
STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.
The lists use less memory, so depending on how many columns you have, it could make a noticeable difference. Of course, this then requires your XML tags to be in a consistent order, which they do appear to be. The DataFrame()
call would also need to do less work, as it doesn't have to lookup keys in the dict on every row, to figure out what column if for what value.
Related videos on Youtube
Parfait
Data analytics polygot dabbling in general-purpose (Python, R, PHP, Java); special-purpose (XSLT, SQL); formats (XML, JSON, CSV); statistical packages (SAS, Stata); relational databases (PostgreSQL, SQL Server, MySQL, MS Access, SQLite). And still pouring out on midnight bridges: Oh I'm so lonely, Since my baby left me. I got no money, And nothing is free. I've been so lonely, On this long, hard road. All I do now these days, Is tweak this rusty, rusty old code.
Updated on June 28, 2022Comments
-
Parfait almost 2 years
Currently, pandas I/O tools does not maintain a
read_xml()
method and the counterpartto_xml()
. However,read_json
proves tree-like structures can be implemented for dataframe import andread_html
for markup formats.If the pandas team does consider such a
read_xml
method for a future pandas version, what implementation would they pursue: parsing with built-inxml.etree.ElementTree
with itsiterfind()
oriterparse()
functions or the third-party module,lxml
with its XPath 1.0 and XSLT 1.0 methods?Below are my test runs for four method types on a simple, flat, element-centric XML input. All are set up for generalized parsing for any second level children of root and each method should yield exact same pandas dataframe. All but the last calls
pd.Dataframe()
on list of dictionaries. The XSLT method transforms XML to CSV for castedStringIO()
inpd.read_csv()
.Question (multi-part)
PERFORMANCE: How do you explain the slower
iterparse
often recommended for larger files as file is iteratively parsed? Is it partly due to theif
logic checks?MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.
STRATEGY: Is list of dictionaries an optimal strategy for
Dataframe()
call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.
Input Data (Stack Overflow's current top users by year of which our pandas friends are included)
<?xml version="1.0" encoding="utf-8"?> <stackoverflow> <topusers> <user>Gordon Linoff</user> <link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link> <location>New York, United States</location> <year_rep>5,985</year_rep> <total_rep>499,408</total_rep> <tag1>sql</tag1> <tag2>sql-server</tag2> <tag3>mysql</tag3> </topusers> <topusers> <user>Günter Zöchbauer</user> <link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link> <location>Linz, Austria</location> <year_rep>5,835</year_rep> <total_rep>154,439</total_rep> <tag1>angular2</tag1> <tag2>typescript</tag2> <tag3>javascript</tag3> </topusers> <topusers> <user>jezrael</user> <link>http://www.stackoverflow.com//users/2901002/jezrael</link> <location>Bratislava, Slovakia</location> <year_rep>5,740</year_rep> <total_rep>83,237</total_rep> <tag1>pandas</tag1> <tag2>python</tag2> <tag3>dataframe</tag3> </topusers> <topusers> <user>VonC</user> <link>http://www.stackoverflow.com//users/6309/vonc</link> <location>France</location> <year_rep>5,577</year_rep> <total_rep>651,397</total_rep> <tag1>git</tag1> <tag2>github</tag2> <tag3>docker</tag3> </topusers> <topusers> <user>Martijn Pieters</user> <link>http://www.stackoverflow.com//users/100297/martijn-pieters</link> <location>Cambridge, United Kingdom</location> <year_rep>5,337</year_rep> <total_rep>525,176</total_rep> <tag1>python</tag1> <tag2>python-3.x</tag2> <tag3>python-2.7</tag3> </topusers> <topusers> <user>T.J. Crowder</user> <link>http://www.stackoverflow.com//users/157247/t-j-crowder</link> <location>United Kingdom</location> <year_rep>5,258</year_rep> <total_rep>508,310</total_rep> <tag1>javascript</tag1> <tag2>jquery</tag2> <tag3>java</tag3> </topusers> <topusers> <user>akrun</user> <link>http://www.stackoverflow.com//users/3732271/akrun</link> <location></location> <year_rep>5,188</year_rep> <total_rep>229,553</total_rep> <tag1>r</tag1> <tag2>dplyr</tag2> <tag3>dataframe</tag3> </topusers> <topusers> <user>Wiktor Stribi?ew</user> <link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link> <location>Warsaw, Poland</location> <year_rep>4,948</year_rep> <total_rep>158,134</total_rep> <tag1>regex</tag1> <tag2>javascript</tag2> <tag3>c#</tag3> </topusers> <topusers> <user>Darin Dimitrov</user> <link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link> <location>Sofia, Bulgaria</location> <year_rep>4,936</year_rep> <total_rep>709,683</total_rep> <tag1>c#</tag1> <tag2>asp.net-mvc</tag2> <tag3>asp.net-mvc-3</tag3> </topusers> <topusers> <user>Eric Duminil</user> <link>http://www.stackoverflow.com//users/6419007/eric-duminil</link> <location></location> <year_rep>4,854</year_rep> <total_rep>12,557</total_rep> <tag1>ruby</tag1> <tag2>ruby-on-rails</tag2> <tag3>arrays</tag3> </topusers> <topusers> <user>alecxe</user> <link>http://www.stackoverflow.com//users/771848/alecxe</link> <location>New York, United States</location> <year_rep>4,723</year_rep> <total_rep>233,368</total_rep> <tag1>python</tag1> <tag2>selenium</tag2> <tag3>protractor</tag3> </topusers> <topusers> <user>Jean-François Fabre</user> <link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link> <location>Toulouse, France</location> <year_rep>4,526</year_rep> <total_rep>30,027</total_rep> <tag1>python</tag1> <tag2>python-3.x</tag2> <tag3>python-2.7</tag3> </topusers> <topusers> <user>piRSquared</user> <link>http://www.stackoverflow.com//users/2336654/pirsquared</link> <location>Bellevue, WA, United States</location> <year_rep>4,482</year_rep> <total_rep>41,183</total_rep> <tag1>pandas</tag1> <tag2>python</tag2> <tag3>dataframe</tag3> </topusers> <topusers> <user>CommonsWare</user> <link>http://www.stackoverflow.com//users/115145/commonsware</link> <location>Who Wants to Know?</location> <year_rep>4,475</year_rep> <total_rep>616,135</total_rep> <tag1>android</tag1> <tag2>java</tag2> <tag3>android-intent</tag3> </topusers> <topusers> <user>Quentin</user> <link>http://www.stackoverflow.com//users/19068/quentin</link> <location>United Kingdom</location> <year_rep>4,464</year_rep> <total_rep>509,365</total_rep> <tag1>javascript</tag1> <tag2>html</tag2> <tag3>css</tag3> </topusers> <topusers> <user>Jon Skeet</user> <link>http://www.stackoverflow.com//users/22656/jon-skeet</link> <location>Reading, United Kingdom</location> <year_rep>4,348</year_rep> <total_rep>921,690</total_rep> <tag1>c#</tag1> <tag2>java</tag2> <tag3>.net</tag3> </topusers> <topusers> <user>Felix Kling</user> <link>http://www.stackoverflow.com//users/218196/felix-kling</link> <location>Sunnyvale, CA</location> <year_rep>4,324</year_rep> <total_rep>411,535</total_rep> <tag1>javascript</tag1> <tag2>jquery</tag2> <tag3>asynchronous</tag3> </topusers> <topusers> <user>matt</user> <link>http://www.stackoverflow.com//users/341994/matt</link> <location></location> <year_rep>4,313</year_rep> <total_rep>220,515</total_rep> <tag1>swift</tag1> <tag2>ios</tag2> <tag3>xcode</tag3> </topusers> <topusers> <user>Psidom</user> <link>http://www.stackoverflow.com//users/4983450/psidom</link> <location>Atlanta, GA, United States</location> <year_rep>4,236</year_rep> <total_rep>36,950</total_rep> <tag1>python</tag1> <tag2>pandas</tag2> <tag3>r</tag3> </topusers> <topusers> <user>Martin R</user> <link>http://www.stackoverflow.com//users/1187415/martin-r</link> <location>Germany</location> <year_rep>4,195</year_rep> <total_rep>269,380</total_rep> <tag1>swift</tag1> <tag2>ios</tag2> <tag3>swift3</tag3> </topusers> <topusers> <user>Barmar</user> <link>http://www.stackoverflow.com//users/1491895/barmar</link> <location>Arlington, MA</location> <year_rep>4,179</year_rep> <total_rep>289,989</total_rep> <tag1>javascript</tag1> <tag2>php</tag2> <tag3>jquery</tag3> </topusers> <topusers> <user>Alexey Mezenin</user> <link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link> <location>??????</location> <year_rep>4,142</year_rep> <total_rep>31,602</total_rep> <tag1>laravel</tag1> <tag2>php</tag2> <tag3>laravel-5.3</tag3> </topusers> <topusers> <user>BalusC</user> <link>http://www.stackoverflow.com//users/157882/balusc</link> <location>Amsterdam, Netherlands</location> <year_rep>4,046</year_rep> <total_rep>703,046</total_rep> <tag1>java</tag1> <tag2>jsf</tag2> <tag3>servlets</tag3> </topusers> <topusers> <user>GurV</user> <link>http://www.stackoverflow.com//users/6348498/gurv</link> <location></location> <year_rep>4,016</year_rep> <total_rep>7,932</total_rep> <tag1>sql</tag1> <tag2>mysql</tag2> <tag3>sql-server</tag3> </topusers> <topusers> <user>Nina Scholz</user> <link>http://www.stackoverflow.com//users/1447675/nina-scholz</link> <location>Berlin, Deutschland</location> <year_rep>3,950</year_rep> <total_rep>61,135</total_rep> <tag1>javascript</tag1> <tag2>arrays</tag2> <tag3>object</tag3> </topusers> <topusers> <user>JB Nizet</user> <link>http://www.stackoverflow.com//users/571407/jb-nizet</link> <location>Saint-Etienne, France</location> <year_rep>3,923</year_rep> <total_rep>418,780</total_rep> <tag1>java</tag1> <tag2>hibernate</tag2> <tag3>java-8</tag3> </topusers> <topusers> <user>Frank van Puffelen</user> <link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link> <location>San Francisco, CA</location> <year_rep>3,920</year_rep> <total_rep>86,520</total_rep> <tag1>firebase</tag1> <tag2>firebase-database</tag2> <tag3>android</tag3> </topusers> <topusers> <user>dasblinkenlight</user> <link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link> <location>United States</location> <year_rep>3,886</year_rep> <total_rep>475,813</total_rep> <tag1>c#</tag1> <tag2>java</tag2> <tag3>c++</tag3> </topusers> <topusers> <user>Tim Biegeleisen</user> <link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link> <location>Singapore</location> <year_rep>3,814</year_rep> <total_rep>77,211</total_rep> <tag1>sql</tag1> <tag2>mysql</tag2> <tag3>java</tag3> </topusers> <topusers> <user>Greg Hewgill</user> <link>http://www.stackoverflow.com//users/893/greg-hewgill</link> <location>Christchurch, New Zealand</location> <year_rep>3,796</year_rep> <total_rep>529,137</total_rep> <tag1>git</tag1> <tag2>python</tag2> <tag3>git-pull</tag3> </topusers> <topusers> <user>unutbu</user> <link>http://www.stackoverflow.com//users/190597/unutbu</link> <location></location> <year_rep>3,735</year_rep> <total_rep>401,595</total_rep> <tag1>python</tag1> <tag2>pandas</tag2> <tag3>numpy</tag3> </topusers> <topusers> <user>Hans Passant</user> <link>http://www.stackoverflow.com//users/17034/hans-passant</link> <location>Madison, WI</location> <year_rep>3,688</year_rep> <total_rep>672,118</total_rep> <tag1>c#</tag1> <tag2>.net</tag2> <tag3>winforms</tag3> </topusers> <topusers> <user>Jonathan Leffler</user> <link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link> <location>California, USA</location> <year_rep>3,649</year_rep> <total_rep>455,157</total_rep> <tag1>c</tag1> <tag2>bash</tag2> <tag3>unix</tag3> </topusers> <topusers> <user>paxdiablo</user> <link>http://www.stackoverflow.com//users/14860/paxdiablo</link> <location></location> <year_rep>3,636</year_rep> <total_rep>507,043</total_rep> <tag1>c</tag1> <tag2>c++</tag2> <tag3>bash</tag3> </topusers> <topusers> <user>Pranav C Balan</user> <link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link> <location>Ramanthali, Kannur, Kerala, India</location> <year_rep>3,604</year_rep> <total_rep>64,476</total_rep> <tag1>javascript</tag1> <tag2>jquery</tag2> <tag3>html</tag3> </topusers> <topusers> <user>Suragch</user> <link>http://www.stackoverflow.com//users/3681880/suragch</link> <location>Hohhot, China</location> <year_rep>3,580</year_rep> <total_rep>71,032</total_rep> <tag1>swift</tag1> <tag2>ios</tag2> <tag3>android</tag3> </topusers> </stackoverflow>
Python Methods
import xml.etree.ElementTree as et import pandas as pd from io import StringIO from lxml import etree as lxet def read_xml_iterfind(): tree = et.parse('Input.xml') data = [] inner = {} for el in tree.iterfind('./*'): for i in el.iterfind('*'): inner[i.tag] = i.text data.append(inner) inner = {} df = pd.DataFrame(data) def read_xml_iterparse(): data = [] inner = {} i = 1 for (ev, el) in et.iterparse(path): if i <= 2: first_tag = el.tag if el.tag == first_tag and len(inner) != 0: data.append(inner) inner = {} if el.text is not None and len(el.text.strip()) > 0: inner[el.tag] = el.text i += 1 df = pd.DataFrame(data) def read_xml_lxml_xpath(): tree = lxet.parse('Input.xml') data = [] inner = {} for el in tree.xpath('/*/*'): for i in el: inner[i.tag] = i.text data.append(inner) inner = {} df = pd.DataFrame(data) def read_xml_lxml_xsl(): xml = lxet.parse('Input.xml') xslstr = ''' <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/> <xsl:strip-space elements="*"/> <!-- HEADERS --> <xsl:template match = "/*"> <xsl:for-each select="*[1]/*"> <xsl:value-of select="local-name()" /> <xsl:choose> <xsl:when test="position() != last()"> <xsl:text>,</xsl:text> </xsl:when> <xsl:otherwise> <xsl:text>
</xsl:text> </xsl:otherwise> </xsl:choose> </xsl:for-each> <xsl:apply-templates/> </xsl:template> <!-- DATA ROWS (COMMA-SEPARATED) --> <xsl:template match="/*/*" priority="2"> <xsl:for-each select="*"> <xsl:if test="position() = 1"> <xsl:text>"</xsl:text> </xsl:if> <xsl:value-of select="." /> <xsl:choose> <xsl:when test="position() != last()"> <xsl:text>","</xsl:text> </xsl:when> <xsl:otherwise> <xsl:text>"
</xsl:text> </xsl:otherwise> </xsl:choose> </xsl:for-each> </xsl:template> </xsl:transform> ''' xsl = lxet.fromstring(xslstr) transform = lxet.XSLT(xsl) newdom = transform(xml) df = pd.read_csv(StringIO(str(newdom)))
Timings (with current XML and XML with 25 times the children (i.e., 900 StackOverflow user records)
# SHORTER FILE python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()' 100 loops, best of 3: 3.87 msec per loop python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()' 100 loops, best of 3: 5.5 msec per loop python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()' 100 loops, best of 3: 3.86 msec per loop python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()' 100 loops, best of 3: 5.68 msec per loop # LARGER FILE python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()' 100 loops, best of 3: 36 msec per loop python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()' 100 loops, best of 3: 78.9 msec per loop python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()' 100 loops, best of 3: 32.7 msec per loop python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()' 100 loops, best of 3: 51.4 msec per loop
-
westr over 4 yearsAs far as I know this is also discussed on pandas github. Maybe open an issue there?
-
Parfait over 4 yearsThank you for your answer. However, your reply appears to be general for Python and not the specific XML methods proposed for Pandas. Maybe a specific coding example can illustrate better like the JIT idea or Cython using above reproducible example?
-
Jayen over 4 yearsMaybe I didn't understand your questions? If something applies to all Python code, then it applies to your Python code. If you are looking for code examples, that was not clear from your question.