pandas: multiple conditions while indexing data frame - unexpected behavior
Solution 1
As you can see, the AND operator drops every row in which at least one value equals -1. On the other hand, the OR operator requires both values to be equal to -1 to drop them.
That's right. Remember that you're writing the condition in terms of what you want to keep, not in terms of what you want to drop. For df1
:
df1 = df[(df.a != -1) & (df.b != -1)]
You're saying "keep the rows in which df.a
isn't -1 and df.b
isn't -1", which is the same as dropping every row in which at least one value is -1.
For df2
:
df2 = df[(df.a != -1) | (df.b != -1)]
You're saying "keep the rows in which either df.a
or df.b
is not -1", which is the same as dropping rows where both values are -1.
PS: chained access like df['a'][1] = -1
can get you into trouble. It's better to get into the habit of using .loc
and .iloc
.
Solution 2
You can also use query(), i.e.:
df_filtered = df.query('a == 4 & b != 2')
Solution 3
A little mathematical logic theory here:
"NOT a AND NOT b" is the same as "NOT (a OR b)", so:
"a NOT -1 AND b NOT -1" is equivalent of "NOT (a is -1 OR b is -1)", which is opposite (Complement) of "(a is -1 OR b is -1)".
So if you want exact opposite result, df1 and df2 should be as below:
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a == -1) | (df.b == -1)]
Related videos on Youtube
Wojciech Walczak
Updated on April 28, 2022Comments
-
Wojciech Walczak about 2 years
I am filtering rows in a dataframe by values in two columns.
For some reason the OR operator behaves like I would expect AND operator to behave and vice versa.
My test code:
import pandas as pd df = pd.DataFrame({'a': range(5), 'b': range(5) }) # let's insert some -1 values df['a'][1] = -1 df['b'][1] = -1 df['a'][3] = -1 df['b'][4] = -1 df1 = df[(df.a != -1) & (df.b != -1)] df2 = df[(df.a != -1) | (df.b != -1)] print pd.concat([df, df1, df2], axis=1, keys = [ 'original df', 'using AND (&)', 'using OR (|)',])
And the result:
original df using AND (&) using OR (|) a b a b a b 0 0 0 0 0 0 0 1 -1 -1 NaN NaN NaN NaN 2 2 2 2 2 2 2 3 -1 3 NaN NaN -1 3 4 4 -1 NaN NaN 4 -1 [5 rows x 6 columns]
As you can see, the
AND
operator drops every row in which at least one value equals-1
. On the other hand, theOR
operator requires both values to be equal to-1
to drop them. I would expect exactly the opposite result. Could anyone explain this behavior, please?I am using pandas 0.13.1.
-
cs95 over 5 years
df.query
andpd.eval
seem like good fits for this use case. For information on thepd.eval()
family of functions, their features and use cases, please visit Dynamic Expression Evaluation in pandas using pd.eval().
-
-
Phillip Cloud about 10 years
DataFrame.query()
works nicely here too.df.query('a != -1 or b != -1')
. -
stoves about 7 yearsHappen to know why pandas wants
&
and|
overand
andor
? -
DSM about 7 years@stoves: in normal Python code,
and
andor
have basic Python semantics that can't be modified.&
and|
, on the other hand, have corresponding special methods which control their behaviour. (In query strings, of course, we're free to apply any parsing we like.) -
3pitt about 6 yearsinterestingly, it seems like
df[True & False]
fails butdf[(True) & (False)]
succeeds (not tested on this example) -
tommy.carstensen over 5 yearsWould it be possible to break this kind of syntax across multiple lines? What would be most PEP8?
-
Aus_10 about 4 yearsI have a situation where I think this syntax makes more sense e.g.: df.query(''(a==4 & b!=2) | c== 3")
-
SeaBean almost 3 yearsAs
&
has a higher operator precedence than==
, is it special for.query()
not to interpret the boolean as something likea == (4 & b)
?