How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

python indexing pandas dataframe hdfstore

15,866

You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.

import pandas as pd
import numpy as np
import os

files = ['test1.csv','test2.csv']
for f in files:
    pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)

path = 'test.h5'
if os.path.exists(path):
    os.remove(path)

with pd.get_store(path) as store:
    for f in files:
        df = pd.read_csv(f,index_col=0)
        try:
            nrows = store.get_storer('foo').nrows
        except:
            nrows = 0

        df.index = pd.Series(df.index) + nrows
        store.append('foo',df)


In [10]: pd.read_hdf('test.h5','foo')
Out[10]: 
           A         B
0   0.772017  0.153381
1   0.304131  0.368573
2   0.995465  0.799655
3  -0.326959  0.923280
4  -0.808376  0.449645
5  -1.336166  0.236968
6  -0.593523 -0.359080
7  -0.098482  0.037183
8   0.315627 -1.027162
9  -1.084545 -1.922288
10  0.412407 -0.270916
11  1.835381 -0.737411
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798
15  1.181344  0.354411
16  0.501892 -0.358361
17  0.633256  0.419397
18  0.932354 -0.603932
19 -0.341135  2.453220

You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.

In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]: 
           A         B
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798

15,866

Ben Scherrey

Chief Systems Architect for HTKN which runs on Stellar Founder of Proteus Technologies and ProteusOps

Updated on June 04, 2022

Comments

Ben Scherrey almost 2 years

I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.

if hd_file_name:
        """
        HDF5 output file specified.
        """

        hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
        print hdf_output

        columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result', 
                   'response_size', 'referrer', 'user_agent', 'response_time']

        source_name = str(log_file.name.rsplit('/')[-1])   # HDF5 Tables don't play nice with unicode so explicit str(). :(

        batch = []

        for count, line in enumerate(log_file,1):
            data = parse_line(line, rejected_output = reject_output)

            # Add our source file name to the beginning.
            data.insert(0, source_name )    
            batch.append(data)

            if not (count % 10):
                df = pd.DataFrame( batch, columns = columns )
                hdf_output.append(KEY_NAME, df)
                batch = []

        if (count % 10):
            df = pd.DataFrame( batch, columns = columns )
            hdf_output.append(KEY_NAME, df)

Carst over 10 years

Great post; there are cases (like these) where it would be ideal to have an autoincremental index instead of this workaround however. as in: when you save to hdf5, you can have a ignore_index=True (like pd.Concat) and it does this automatically under the hood? Just an idea
Carst over 10 years

Also: when using large amounts of small files, it would be better to first get the current number of rows and up nrows each time with the length of the current dataframe at the end of each append action (much better for performance). If this is a vague comment, let me know and i will work it out somewhere
Jeff over 10 years

you don't really need to do either of these. as the stores are 'numbered' internally, you can then select by row number if you want to, (see here)[pandas.pydata.org/pandas-docs/dev/io.html#advanced-que‌ries]
Carst over 10 years

In the use case where I would have a separate store or node per file right? The thing is I have 1.2 billion records which I have to select stuff from dynamically. If I spread it over various stores/nodes, I would either a) make a metadata model that remembers what data is in which store or b) go through each store/node for every query). So preferably I would like to have it in big table. But i'm running into some issues (even with index=False the incremental appending goes slower and slower after 200+ mln rows)
Jeff over 10 years

ahh...yes if you split over multiple files that would be true. when you append LOTS of rows, for sure turn off indexing (index=False), and ptrepack occasionally (then index at the very end)