Dask Dataframe: Get row count?
Solution 1
# ensure small enough block size for the graph to fit in your memory
ddf = dask.dataframe.read_csv('*.csv', blocksize="10MB")
ddf.shape[0].compute()
From the documentation:
blocksize <str, int or None> Optional Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.
Solution 2
If you only need the number of rows -
you can load a subset of the columns while selecting the columns with lower memory usage (such as category/integers and not string/object), there after you can run len(df.index)
usbToaster
Updated on June 07, 2022Comments
-
usbToaster about 2 years
Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this?
When I try to run
dataframe.x.count().compute()
it looks like it tries to load the entire data into RAM, for which there is no space and it crashes.