Dask Dataframe: Get row count?

11,650

Solution 1

# ensure small enough block size for the graph to fit in your memory
ddf = dask.dataframe.read_csv('*.csv', blocksize="10MB") 
ddf.shape[0].compute()

From the documentation:

blocksize <str, int or None> Optional Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.

Solution 2

If you only need the number of rows -
you can load a subset of the columns while selecting the columns with lower memory usage (such as category/integers and not string/object), there after you can run len(df.index)

Share:
11,650
usbToaster
Author by

usbToaster

Updated on June 07, 2022

Comments

  • usbToaster
    usbToaster about 2 years

    Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this?

    When I try to run dataframe.x.count().compute() it looks like it tries to load the entire data into RAM, for which there is no space and it crashes.