Iteratively writing to HDF5 Stores in Pandas
Solution 1
As soon as the statement is exectued, eg
store['df'] = df
. Theclose
just closes the actual file (which will be closed for you if the process exists, but will print a warning message)-
Read the section http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format
It is generally not a good idea to put a LOT of nodes in an
.h5
file. You probably want to append and create a smaller number of nodes.You can just iterate thru your
.csv
andstore/append
them one by one. Something like:for f in files: df = pd.read_csv(f) df.to_hdf('file.h5',f,df)
Would be one way (creating a separate node for each file)
-
Not appendable - once you write it, you can only retrieve it all at once, e.g. you cannot select a sub-section
If you have a table, then you can do things like:
pd.read_hdf('my_store.h5','a_table_node',['index>100'])
which is like a database query, only getting part of the data
Thus, a store is not appendable, nor queryable, while a table is both.
Solution 2
Answering question 2, with pandas 0.18.0 you can do:
store = pd.HDFStore('compiled_measurements.h5')
for filepath in file_iterator:
raw = pd.read_csv(filepath)
store.append('measurements', raw, index=False)
store.create_table_index('measurements', columns=['a', 'b', 'c'], optlevel=9, kind='full')
store.close()
Based on this part of the docs.
Depending on how much data you have, the index creation can consume enormous amounts of memory. The PyTables docs describes the values of optlevel.
Amelio Vazquez-Reina
I'm passionate about people, technology and research. Some of my favorite quotes: "Far better an approximate answer to the right question than an exact answer to the wrong question" -- J. Tukey, 1962. "Your title makes you a manager, your people make you a leader" -- Donna Dubinsky, quoted in "Trillion Dollar Coach", 2019.
Updated on July 09, 2022Comments
-
Amelio Vazquez-Reina almost 2 years
Pandas has the following examples for how to store
Series
,DataFrames
andPanels
in HDF5 files:Prepare some data:
In [1142]: store = HDFStore('store.h5') In [1143]: index = date_range('1/1/2000', periods=8) In [1144]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e']) In [1145]: df = DataFrame(randn(8, 3), index=index, ......: columns=['A', 'B', 'C']) ......: In [1146]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], ......: major_axis=date_range('1/1/2000', periods=5), ......: minor_axis=['A', 'B', 'C', 'D']) ......:
Save it in a store:
In [1147]: store['s'] = s In [1148]: store['df'] = df In [1149]: store['wp'] = wp
Inspect what's in the store:
In [1150]: store Out[1150]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df frame (shape->[8,3]) /s series (shape->[5]) /wp wide (shape->[2,5,4])
Close the store:
In [1151]: store.close()
Questions:
In the code above, when is the data actually written to disk?
Say I want to add thousands of large dataframes living in
.csv
files to a single.h5
file. I would need to load them and add them to the.h5
file one by one since I cannot afford to have them all in memory at once as they would take too much memory. Is this possible with HDF5? What would be the correct way to do it?-
The Pandas documentation says the following:
"These stores are not appendable once written (though you simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety."
What does it mean by not appendable nor queryable? Also, shouldn't it say once closed instead of written?