Compare two columns using pandas
Solution 1
You could use np.where. If cond
is a boolean array, and A
and B
are arrays, then
C = np.where(cond, A, B)
defines C to be equal to A
where cond
is True, and B
where cond
is False.
import numpy as np
import pandas as pd
a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
, df['one'], np.nan)
yields
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 NaN
2 8 5 0 NaN
If you have more than one condition, then you could use np.select instead.
For example, if you wish df['que']
to equal df['two']
when df['one'] < df['two']
, then
conditions = [
(df['one'] >= df['two']) & (df['one'] <= df['three']),
df['one'] < df['two']]
choices = [df['one'], df['two']]
df['que'] = np.select(conditions, choices, default=np.nan)
yields
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 70
2 8 5 0 NaN
If we can assume that df['one'] >= df['two']
when df['one'] < df['two']
is
False, then the conditions and choices could be simplified to
conditions = [
df['one'] < df['two'],
df['one'] <= df['three']]
choices = [df['two'], df['one']]
(The assumption may not be true if df['one']
or df['two']
contain NaNs.)
Note that
a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:
df2 = df.astype(float)
This changes the results, however, since strings compare character-by-character, while floats are compared numerically.
In [61]: '10' <= '4.2'
Out[61]: True
In [62]: 10 <= 4.2
Out[62]: False
Solution 2
You can use .equals
for columns or entire dataframes.
df['col1'].equals(df['col2'])
If they're equal, that statement will return True
, else False
.
Solution 3
You could use apply() and do something like this
df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)
or if you prefer not to use a lambda
def que(x):
if x['one'] >= x['two'] and x['one'] <= x['three']:
return x['one']
return ''
df['que'] = df.apply(que, axis=1)
Solution 4
One way is to use a Boolean series to index the column df['one']
. This gives you a new column where the True
entries have the same value as the same row as df['one']
and the False
values are NaN
.
The Boolean series is just given by your if
statement (although it is necessary to use &
instead of and
):
>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 NaN
2 8 5 0 NaN
If you want the NaN
values to be replaced by other values, you can use the fillna
method on the new column que
. I've used 0
instead of the empty string here:
>>> df['que'] = df['que'].fillna(0)
>>> df
one two three que
0 10 1.2 4.2 10
1 15 70 0.03 0
2 8 5 0 0
Solution 5
Wrap each individual condition in parentheses, and then use the &
operator to combine the conditions:
df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']
You can fill the non-matching rows by just using ~
(the "not" operator) to invert the match:
df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''
You need to use &
and ~
rather than and
and not
because the &
and ~
operators work element-by-element.
The final result:
df
Out[8]:
one two three que
0 10 1.2 4.2 10
1 15 70 0.03
2 8 5 0
Comments
-
Merlin almost 2 years
Using this as a starting point:
a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']] df = pd.DataFrame(a, columns=['one', 'two', 'three']) Out[8]: one two three 0 10 1.2 4.2 1 15 70 0.03 2 8 5 0
I want to use something like an
if
statement within pandas.if df['one'] >= df['two'] and df['one'] <= df['three']: df['que'] = df['one']
Basically, check each row via the
if
statement, create new column.The docs say to use
.all
but there is no example...-
Alex Riley over 9 yearsWhat should the value be if the
if
statement isFalse
? -
unutbu over 9 years@Merlin: If you have numeric data in a column, it is best not to mix it with strings. Doing so changes the column's dtype to
object
. This allows arbitrary Python objects to be stored in the column, but it comes at the cost of slower numeric computation. Thus if the column is storing numeric data, using NaNs for not-a-numbers is preferable. -
Primer over 9 yearsHaving integers as strings and trying to do comparison on them looks odd:
a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
. This creates confusing results with "correct" code:df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
yields10
for the first line, while it should yieldNaN
if the input would have been integers.
-
-
Marius over 9 yearsI suspect this is probably a bit slower than the other approaches posted, since it doesn't take advantage of the vectorized operations that pandas allows.
-
Merlin over 9 years@BobHaffner: lambda are not readable when using complex if/then/else statements.
-
Bob Haffner over 9 years@Merlin you could add an elseif and I would agree with you on lambdas and multiple conditions
-
AZhao almost 9 yearsis there a way to generalize the non lambda function such that you can pass dataframe columns in, and not change the name?
-
Bob Haffner almost 9 years@AZhao you could generalize with iloc like this df['que'] = df.apply(lambda x : x.iloc[0] if x.iloc[0] >= x.iloc[1] and x.iloc[0] <= x.iloc[2] else "", axis=1) Is that what you mean? Obviously. the order of your columns matter
-
guerda over 5 yearsNote: this only compares the whole column to another one. This does not compare the columsn element wise
-
R. Cox about 5 yearsThis method is the only one that worked for me! stackoverflow.com/questions/54476753/…
-
vasili111 about 4 yearsYour code gives me error
df['que'] = (df['one'] if ((df['one'] >= df['two']) and (df['one'] <= df['three'])) ^ SyntaxError: unexpected EOF while parsing
-
rrlamichhane about 4 yearsHow about if you want to see if one column always has value "greater than" or "lesser than" the other columns?
-
ericzheng0404 almost 2 yearsIt is not comparing whether individual rows of two columns are equal. it is comparing the whole column. If any of the rows is not equal to the other column, the whole column is False.