Pandas "diff()" with string

12,141

Solution 1

I get better performance with ne instead of using the actual != comparison:

df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)

Timings

Using the following setup to produce a larger dataframe:

df = pd.concat([df]*10**5, ignore_index=True) 

I get the following timings:

%timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
10 loops, best of 3: 38.1 ms per loop

%timeit (df.ColumnB != df.ColumnB.shift()).astype(int)
10 loops, best of 3: 77.7 ms per loop

%timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB'])
10 loops, best of 3: 99.6 ms per loop

%timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
10 loops, best of 3: 19.3 ms per loop

Solution 2

Use .shift and compare:

dataframe['changed'] = dataframe['ColumnB'] == dataframe['ColumnB'].shift(1).fillna(dataframe['ColumnB'])

Solution 3

For me works compare with shift, then NaN was replaced 0 because before no value:

df['diff'] = (df.ColumnB != df.ColumnB.shift()).astype(int)
df.ix[0,'diff'] = 0
print (df)
   ColumnA ColumnB  diff
0        1    Blue     0
1        2    Blue     0
2        3     Red     1
3        4     Red     0
4        5  Yellow     1

Edit by timings of another answer - fastest is use ne:

df['diff'] = (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
df.ix[0,'diff'] = 0
Share:
12,141
guilhermecgs
Author by

guilhermecgs

Updated on June 21, 2022

Comments

  • guilhermecgs
    guilhermecgs almost 2 years

    How can I flag a row in a dataframe every time a column change its string value?

    Ex:

    Input

    ColumnA   ColumnB
    1            Blue
    2            Blue
    3            Red
    4            Red
    5            Yellow
    
    
    #  diff won't work here with strings....  only works in numerical values
    dataframe['changed'] = dataframe['ColumnB'].diff()        
    
    
    ColumnA   ColumnB      changed
    1            Blue         0
    2            Blue         0
    3            Red          1
    4            Red          0
    5            Yellow       1
    
  • guilhermecgs
    guilhermecgs over 7 years
    very clean answer
  • juanpa.arrivillaga
    juanpa.arrivillaga over 7 years
    I wonder, is there a performance difference between this approach and simply using !=?
  • jezrael
    jezrael over 7 years
    Please can you add timings for (df.ColumnB.ne(df.ColumnB.shift())).astype(int) ?
  • root
    root over 7 years
    @jezrael: Added the timing. Using ix to make the first row 0 adds ~1 ms to the timing, so it looks to be fastest that way.
  • user466130
    user466130 over 5 years
    Hi, i am using this answer in my script but it returned me 'SettingWithCopyWarning', do you guys see that? dff['changed'] = dff.col1.ne(dff.col1.shift(1))
  • Santhosh Dhaipule Chandrakanth
    Santhosh Dhaipule Chandrakanth over 4 years
    @root How do i get the shift of the state count? that is Blue -> Red , Red -> Yellow in the same sequence as the were detected
  • Santhosh Dhaipule Chandrakanth
    Santhosh Dhaipule Chandrakanth over 4 years
    @root Can i directly know the change in state from Blue to Yellow in spite of having Red in the middle?
  • User7777
    User7777 over 4 years
    @jezrael That how to do the same thing based on two columns?
  • jezrael
    jezrael over 4 years
    @Navroop - do you think df[['ColumnA','ColumnB']].ne(df[['ColumnA','ColumnB']].shift‌​()).any(axis=1).asty‌​pe(int) ?