Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

17,974

Solution 1

You can either use fillna (fast) or you can use apply (slow but flexible)

Fillna

import pandas as pd

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)

ddf['z'] = ddf.y.fillna((100 + ddf.x))

>>> df

   x      y
0  1  0.200
1  2    NaN
2  3  0.345
3  4  0.400
4  5  0.150

>>> ddf.compute()

   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Of course in this case though because your function uses y if y is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.

Use apply

As any Pandas expert will tell you, using apply comes with a 10x to 100x slowdown penalty. Please beware.

That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:

In [1]: import pandas as pd
   ...: 
   ...: import dask.dataframe as dd
   ...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
   ...: ddf = dd.from_pandas(df, npartitions=2)
   ...: 

In [2]: def func(row):
   ...:     if pd.isnull(row['y']):
   ...:         return row['x'] + 100
   ...:     else:
   ...:         return row['y']
   ...:     

In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)

In [4]: ddf.compute()
Out[4]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)

In [6]: ddf.compute()
Out[6]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Solution 2

I do not have any experience with dask but your boolean test will not catch that 2nd element as null in funcUpdate. Null values with pandas are equal to None or NaN/Nan, not "".

def funcUpdate(row):
    try:
        return  round((1 + row['x'])/(1+ 1/row['y']),4)
    except:
        return row['y']

Is a possible workaround but you would need to run data validation before hand.

Share:
17,974
ML_Passion
Author by

ML_Passion

Passionate about these things in life: -- Game Theory -- Machine Learning -- Writing elegant/beautiful code -- Hiking in the mountains

Updated on July 27, 2022

Comments

  • ML_Passion
    ML_Passion almost 2 years

    I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:

    DataFrame definition

    import pandas as pd
    import dask.dataframe as dd
    
    df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]})
    ddf = dd.from_pandas(df1, npartitions=2)
    

    Method-1 tried

    def funcUpdate(row):
        if row['y'].isnull():
            return row['y']
        else:
            return  round((1 + row['x'])/(1+ 1/row['y']),4)
    
    ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))
    

    It gives an error:

    TypeError: Column assignment doesn't support type DataFrame
    

    Method-2

    ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else  round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))
    

    Any idea how it should be done ?

  • ML_Passion
    ML_Passion over 7 years
    you are right I didn't the y to be null. Actually i was defining the Null value incorrectly and that was causing the error too. Thanks for your explaination, helped a lot. Although i am not sure the fastna will help or not because the actual function is like this: def func(row): if pd.isnull(row['y']): return row['x'] else: return round((1 + row['x'])/(1 + 1/row['y']),4) Basically if y is null, the new column z = x , otherwise the new column z is equal to that calculation in the return statement.
  • ML_Passion
    ML_Passion over 7 years
    Another question is why can't isnull be applied to a column of a dask dataframe , using the pandas syntax was kind of surprising to me: if pd.isnull(row['y']):
  • ML_Passion
    ML_Passion over 7 years
    how can I apply .assign in this case ?
  • MRocklin
    MRocklin over 7 years
    I use pd.isnull(row['y']) because row['y'] is a float and doesn't have the pandas methods. ddf['z'] = foo is equivalent to ddf = ddf.assign(z=foo) for any foo.
  • ML_Passion
    ML_Passion over 7 years
    I think of one of the issues in my dataframe definition was that I defined Null as "" , where as it should be defined as "None"