Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
Solution 1
First, create the derived value:
df.loc[0, 'C'] = df.loc[0, 'D']
Then iterate through the remaining rows and fill the calculated values:
for i in range(1, len(df)):
df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']
Index_Date A B C D
0 2015-01-31 10 10 10 10
1 2015-02-01 2 3 23 22
2 2015-02-02 10 60 290 280
Solution 2
Given a column of numbers:
lst = []
cols = ['A']
for a in range(100, 105):
lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df
A
0 100
1 101
2 102
3 103
4 104
You can reference the previous row with shift:
df['Change'] = df.A - df.A.shift(1)
df
A Change
0 100 NaN
1 101 1.0
2 102 1.0
3 103 1.0
4 104 1.0
Solution 3
numba
For recursive calculations which are not vectorisable, numba
, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for
loop and use the decorator @njit
or (for older versions) @jit(nopython=True)
:
For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for
loop:
from numba import jit
@jit(nopython=True)
def calculator_nb(a, b, d):
res = np.empty(d.shape)
res[0] = d[0]
for i in range(1, res.shape[0]):
res[i] = res[i-1] * a[i] + b[i]
return res
df['C'] = calculator_nb(*df[list('ABD')].values.T)
n = 10**5
df = pd.concat([df]*n, ignore_index=True)
# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T) # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T) # 444 ms per loop
Solution 4
Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.
Disclaimer: I know this solution is not standard, but I think it works well.
import pandas as pd
import numpy as np
data = np.array([[10, 2, 10, 10],
[10, 3, 60, 100],
[np.nan] * 4,
[10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
A B C D
=================================
2015-01-31 10 10 NaN 10
2015-02-01 2 3 NaN 22
2015-02-02 10 60 NaN 280
2015-02-03 10 100 NaN 250
def calculate(mul, add):
global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
A B C D
=================================
2015-01-31 10 10 10 10
2015-02-01 2 3 23 22
2015-02-02 10 60 290 280
2015-02-03 10 100 3000 250
So basically we use a apply
from pandas and the help of a global variable that keeps track of the previous calculated value.
Time comparison with a for
loop:
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']
%%timeit
for i in df.loc['2015-02-01':].index.date:
df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']
3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
def calculate(mul, add):
global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So 0.57 times faster on average.
Solution 5
It's an old question but the solution below (without a for loop) might be helpful:
def new_fun(df):
prev_value = df.iloc[0]["C"]
def func2(row):
# non local variable ==> will use pre_value from the new_fun function
nonlocal prev_value
new_value = prev_value * row['A'] + row['B']
prev_value = row['C']
return new_value
# This line might throw a SettingWithCopyWarning warning
df.iloc[1:]["C"] = df.iloc[1:].apply(func2, axis=1)
return df
df = new_fun(df)
ctrl-alt-delete
Polyglot Programmer interested in Rust, Python and all things data.
Updated on July 08, 2022Comments
-
ctrl-alt-delete almost 2 years
I have the following dataframe:
Index_Date A B C D ================================ 2015-01-31 10 10 Nan 10 2015-02-01 2 3 Nan 22 2015-02-02 10 60 Nan 280 2015-02-03 10 100 Nan 250
Require:
Index_Date A B C D ================================ 2015-01-31 10 10 10 10 2015-02-01 2 3 23 22 2015-02-02 10 60 290 280 2015-02-03 10 100 3000 250
Column C
is derived for2015-01-31
by takingvalue
ofD
.Then I need to use the
value
ofC
for2015-01-31
and multiply by thevalue
ofA
on2015-02-01
and addB
.I have attempted an
apply
and ashift
using anif else
by this gives a key error. -
ctrl-alt-delete over 8 yearsis there a function in pandas to do this without the loop?
-
Stefan over 8 yearsThe iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization. You could perhaps use
apply
with a function that does the same calculation as the loop, but behind the scenes this would also be a loop. pandas.pydata.org/pandas-docs/version/0.17.1/generated/… -
ctrl-alt-delete over 8 yearsIf I use this loop and calculate on a merged dataframe and it finds a Nan it works but only to the row with Nan. No errors are thrown, If I try a fillNa I get AttributeError: 'numpy.float64' object has no attribute 'fillna' Is there any way to skip the row with Nan or set values to zero?
-
Stefan over 8 yearsDo you mean missing values in columns other than
C
? -
ctrl-alt-delete over 8 yearsYes your solution is fine. I just ensure I fill the Nans in the dataframe before the loop.
-
Bill over 5 yearsThis won't help in this situation because the value from the previous row is not known at the beginning. It has to be computed each iteration and then used in the next iteration.
-
Kevin Pauli about 4 yearsI still am grateful for this answer because I stumbled across this, looking for a case where I do know the value from the previous row. So thanks @kztd
-
Artem Malikov about 4 yearsIt is wonderful! I have accelerated my function, which counts values from previous values. Thanks!
-
Sodanetworks almost 4 yearsI'm trying to implement this solution, however, this only applies to dataframes with sequential index (0,1,2,3). My DataFrame has no sequential index (0,3,4,9). Any ideas how to handle it? ..After I post this question, I found that this can be solved by iloc instead of loc.
-
Bryan Butler over 3 yearsThis is a great solution that I needed for doing a mortgage calculation table. I was thinking of a shift as below, but this works very well, and with mortgages the max amount of rows is 360, so speed will never be an issue.
-
luney over 3 years@Sodanetworks , I got also issues with no sequential index. I couldn't just replace loc by iloc. I finally ended by regenerating the index using df.reset_index(drop=True, inplace=True)
-
sergzemsk over 3 yearsHow can I use
@jit(nopython=True)
in jupyter-notebook? -
jpp over 3 years@sergzemsk, Just as you've written it (and in my answer), it's called a decorator. Note later versions of numba support the shortcut
@njit
. -
sergzemsk over 3 years@jpp i have
if
condition so this improvement failed. I got an error "TypingError: Failed in nopython mode pipeline (step: nopython frontend)" -
jpp over 3 years@sergzemsk, I suggest you ask a new question, not clear to me where the
if
statement sits, why it's not being vectorised by numba. -
feetwet about 3 yearsThis makes some assumptions about
.apply
that may not be true: If.apply
is parallelized or called in anything other than the order you expect the results will not be as expected. -
Wazaa about 3 yearsI agree with your concerns. The assumptions in this anwser are based on the question of this thread. Also, apply isn't parallelized by default ...
-
logicOnAbstractions over 2 years@Stefan it is true that behind the scene if you use apply(func....) you'd still have a loop. However I'd argue that this makes the code much cleaner, easier to read & easier to maintain, especially if one has a bunch of columns to compute this way.
-
Dimanjan over 2 yearsExactly what I was looking for. This also works faster because it has array operation instead of looping as suggested on other answers.