Python DataFrames For Loop with If Statement not working

13,202

I think better is use numpy.where:

mask = ES_15M_Summary['Rolling_OLS_Coefficient'] > .08
ES_15M_Summary['Long'] = np.where(mask, 'Y', 'N')

Sample:

ES_15M_Summary = pd.DataFrame({'Rolling_OLS_Coefficient':[0.07,0.01,0.09]})
print (ES_15M_Summary)
   Rolling_OLS_Coefficient
0                     0.07
1                     0.01
2                     0.09

mask = ES_15M_Summary['Rolling_OLS_Coefficient'] > .08
ES_15M_Summary['Long'] = np.where(mask, 'Y', 'N')
print (ES_15M_Summary)
   Rolling_OLS_Coefficient Long
0                     0.07    N
1                     0.01    N
2                     0.09    Y

Looping, very slow solution:

for index, row in ES_15M_Summary.iterrows():
    if ES_15M_Summary.loc[index, 'Rolling_OLS_Coefficient'] > .08:
        ES_15M_Summary.loc[index,'Long'] = 'Y'
    else:
        ES_15M_Summary.loc[index,'Long'] = 'N'
print (ES_15M_Summary)
   Rolling_OLS_Coefficient Long
0                     0.07    N
1                     0.01    N
2                     0.09    Y

Timings:

#3000 rows
ES_15M_Summary = pd.DataFrame({'Rolling_OLS_Coefficient':[0.07,0.01,0.09] * 1000})
#print (ES_15M_Summary)


def loop(df):
    for index, row in ES_15M_Summary.iterrows():
        if ES_15M_Summary.loc[index, 'Rolling_OLS_Coefficient'] > .08:
            ES_15M_Summary.loc[index,'Long'] = 'Y'
        else:
            ES_15M_Summary.loc[index,'Long'] = 'N'
    return (ES_15M_Summary)

print (loop(ES_15M_Summary))


In [51]: %timeit (loop(ES_15M_Summary))
1 loop, best of 3: 2.38 s per loop

In [52]: %timeit ES_15M_Summary['Long'] = np.where(ES_15M_Summary['Rolling_OLS_Coefficient'] > .08, 'Y', 'N')
1000 loops, best of 3: 555 µs per loop
Share:
13,202
Cole Starbuck
Author by

Cole Starbuck

Updated on December 02, 2022

Comments

  • Cole Starbuck
    Cole Starbuck over 1 year

    I have a DataFrame called ES_15M_Summary, with coefficients/betas in on column titled ES_15M_Summary['Rolling_OLS_Coefficient'] as follows:

    Column 'Rolling_OLS_Coefficient'

    If the above pictured column ('Rolling_OLS_Coefficient') is a value greater than .08, I want a new column titled 'Long' to be a binary 'Y'. If the value in the other column is less than .08, I want that value to be 'NaN' or just 'N' (either works).

    So I'm writing a for loop to run down the columns. First, I created a new column titled 'Long' and set it to NaN:

    ES_15M_Summary['Long'] = np.nan
    

    Then I made the following For Loop:

    for index, row in ES_15M_Summary.iterrows():
        if ES_15M_Summary['Rolling_OLS_Coefficient'] > .08:
            ES_15M_Summary['Long'] = 'Y'
        else:
            ES_15M_Summary['Long'] = 'NaN'
    

    I get the error:

    ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 
    

    ...referring to the if statement line shown above (if...>.08:). I'm not sure why I'm getting this error or what's wrong with the for loop. Any help is appreciated.

  • Cole Starbuck
    Cole Starbuck about 7 years
    Thank You, I'm using the for loop you provided. Much appreciated.