Pandas "diff()" with string
12,141
Solution 1
I get better performance with ne
instead of using the actual !=
comparison:
df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
Timings
Using the following setup to produce a larger dataframe:
df = pd.concat([df]*10**5, ignore_index=True)
I get the following timings:
%timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
10 loops, best of 3: 38.1 ms per loop
%timeit (df.ColumnB != df.ColumnB.shift()).astype(int)
10 loops, best of 3: 77.7 ms per loop
%timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB'])
10 loops, best of 3: 99.6 ms per loop
%timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
10 loops, best of 3: 19.3 ms per loop
Solution 2
Use .shift
and compare:
dataframe['changed'] = dataframe['ColumnB'] == dataframe['ColumnB'].shift(1).fillna(dataframe['ColumnB'])
Solution 3
For me works compare with shift
, then NaN
was replaced 0
because before no value:
df['diff'] = (df.ColumnB != df.ColumnB.shift()).astype(int)
df.ix[0,'diff'] = 0
print (df)
ColumnA ColumnB diff
0 1 Blue 0
1 2 Blue 0
2 3 Red 1
3 4 Red 0
4 5 Yellow 1
Edit by timings of another answer - fastest is use ne
:
df['diff'] = (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
df.ix[0,'diff'] = 0
Author by
guilhermecgs
Updated on June 21, 2022Comments
-
guilhermecgs almost 2 years
How can I flag a row in a dataframe every time a column change its string value?
Ex:
Input
ColumnA ColumnB 1 Blue 2 Blue 3 Red 4 Red 5 Yellow # diff won't work here with strings.... only works in numerical values dataframe['changed'] = dataframe['ColumnB'].diff() ColumnA ColumnB changed 1 Blue 0 2 Blue 0 3 Red 1 4 Red 0 5 Yellow 1
-
guilhermecgs over 7 yearsvery clean answer
-
juanpa.arrivillaga over 7 yearsI wonder, is there a performance difference between this approach and simply using
!=
? -
jezrael over 7 yearsPlease can you add timings for
(df.ColumnB.ne(df.ColumnB.shift())).astype(int)
? -
root over 7 years@jezrael: Added the timing. Using
ix
to make the first row 0 adds ~1 ms to the timing, so it looks to be fastest that way. -
user466130 over 5 yearsHi, i am using this answer in my script but it returned me 'SettingWithCopyWarning', do you guys see that? dff['changed'] = dff.col1.ne(dff.col1.shift(1))
-
Santhosh Dhaipule Chandrakanth over 4 years@root How do i get the shift of the state count? that is
Blue -> Red
,Red -> Yellow
in the same sequence as the were detected -
Santhosh Dhaipule Chandrakanth over 4 years@root Can i directly know the change in state from
Blue
toYellow
in spite of havingRed
in the middle? -
User7777 over 4 years@jezrael That how to do the same thing based on two columns?
-
jezrael over 4 years@Navroop - do you think
df[['ColumnA','ColumnB']].ne(df[['ColumnA','ColumnB']].shift()).any(axis=1).astype(int)
?