dask dataframe apply meta
meta
is the prescription of the names/types of the output from the computation. This is required because apply()
is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta
, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening.
You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.
The output of your computation is actually a series, so the following is the simplest that works
(dask_df.groupby('Column B')
.apply(len, meta=('int'))).compute()
but more accurate would be
(dask_df.groupby('Column B')
.apply(len, meta=pd.Series(dtype='int', name='Column B')))
Related videos on Youtube
Matti Lyra
I am a PhD student at the University of Sussex studying Computational Linguistics. My research topic is "Topical Subcategory Structure in Text Classification"
Updated on October 15, 2022Comments
-
Matti Lyra over 1 year
I'm wanting to do a frequency count on a single column of a
dask
dataframe. The code works, but I get anwarning
complaining thatmeta
is not defined. If I try to definemeta
I get an errorAttributeError: 'DataFrame' object has no attribute 'name'
. For this particular use case it doesn't look like I need to definemeta
but I'd like to know how to do that for future reference.Dummy dataframe and the column frequencies
import pandas as pd from dask import dataframe as dd df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'], ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'], [12, 10, 15, 23, 18, 20, 26]], index=['Column A', 'Column B', 'Column C']).T dask_df = dd.from_pandas(df)
In [39]: dask_df.head() Out[39]: Column A Column B Column C 0 Sam Sam 12 1 Alex David 10 2 David David 15 3 Sarah Alice 23 4 Alice Sam 18
(dask_df.groupby('Column B') .apply(lambda group: len(group)) ).compute() UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result warnings.warn(msg) Out[60]: Column B Alice 2 David 2 Sam 3 dtype: int64
Trying to define
meta
producesAttributeError
(dask_df.groupby('Column B') .apply(lambda d: len(d), meta={'Column B': 'int'})).compute()
same for this
(dask_df.groupby('Column B') .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()
same if I try having the
dtype
beint
instead of"int"
or for that matter'f8'
ornp.float64
so it doesn't seem like it's thedtype
that is causing the problem.The documentation on
meta
seems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).What is
meta
? and how am I supposed to define it?Using
python 3.6
dask 0.14.3
andpandas 0.20.2
-
Bob Haffner about 7 yearsHmm, not sure why that would fail. Does this work
meta=('Column B', 'int')
? -
Matti Lyra about 7 yearsboth of those seem to do the right thing, no idea which one is the most effective
-
-
djakubosky over 5 yearsis there any performance boost to including the full pd.Series meta?
-
mdurant over 5 yearsNo, but it's more explicit, and in some cases allows you finer control, e.g., over the name and type of the index.