Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement
Solution 1
You can either use fillna
(fast) or you can use apply
(slow but flexible)
Fillna
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)
ddf['z'] = ddf.y.fillna((100 + ddf.x))
>>> df
x y
0 1 0.200
1 2 NaN
2 3 0.345
3 4 0.400
4 5 0.150
>>> ddf.compute()
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
Of course in this case though because your function uses y
if y
is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.
Use apply
As any Pandas expert will tell you, using apply
comes with a 10x to 100x slowdown penalty. Please beware.
That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:
In [1]: import pandas as pd
...:
...: import dask.dataframe as dd
...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
...: ddf = dd.from_pandas(df, npartitions=2)
...:
In [2]: def func(row):
...: if pd.isnull(row['y']):
...: return row['x'] + 100
...: else:
...: return row['y']
...:
In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
In [4]: ddf.compute()
Out[4]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)
In [6]: ddf.compute()
Out[6]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
Solution 2
I do not have any experience with dask but your boolean test will not catch that 2nd element as null in funcUpdate. Null values with pandas are equal to None or NaN/Nan, not "".
def funcUpdate(row):
try:
return round((1 + row['x'])/(1+ 1/row['y']),4)
except:
return row['y']
Is a possible workaround but you would need to run data validation before hand.
ML_Passion
Passionate about these things in life: -- Game Theory -- Machine Learning -- Writing elegant/beautiful code -- Hiking in the mountains
Updated on July 27, 2022Comments
-
ML_Passion almost 2 years
I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:
DataFrame definition
import pandas as pd import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]}) ddf = dd.from_pandas(df1, npartitions=2)
Method-1 tried
def funcUpdate(row): if row['y'].isnull(): return row['y'] else: return round((1 + row['x'])/(1+ 1/row['y']),4) ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))
It gives an error:
TypeError: Column assignment doesn't support type DataFrame
Method-2
ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))
Any idea how it should be done ?
-
ML_Passion over 7 yearsyou are right I didn't the y to be null. Actually i was defining the Null value incorrectly and that was causing the error too. Thanks for your explaination, helped a lot. Although i am not sure the fastna will help or not because the actual function is like this:
def func(row): if pd.isnull(row['y']): return row['x'] else: return round((1 + row['x'])/(1 + 1/row['y']),4)
Basically if y is null, the new column z = x , otherwise the new column z is equal to that calculation in the return statement. -
ML_Passion over 7 yearsAnother question is why can't isnull be applied to a column of a dask dataframe , using the pandas syntax was kind of surprising to me:
if pd.isnull(row['y']):
-
ML_Passion over 7 yearshow can I apply .assign in this case ?
-
MRocklin over 7 yearsI use
pd.isnull(row['y'])
becauserow['y']
is a float and doesn't have the pandas methods.ddf['z'] = foo
is equivalent toddf = ddf.assign(z=foo)
for anyfoo
. -
ML_Passion over 7 yearsI think of one of the issues in my dataframe definition was that I defined Null as "" , where as it should be defined as "None"