pandas table subsets giving invalid type comparison error
Solution 1
I think you need add parentheses ()
to conditions, also better is use ix
for selecting column with boolean mask which can be assigned to variable mask
:
mask = (df['A'].notnull()) & (df['B'].isnull()) & (df['C']=='yes')
print (mask)
0 True
1 True
2 False
3 False
dtype: bool
df.ix[mask, 'D'] = df.ix[mask, 'A']
print (df)
A B C D
0 -0.681771 NaN yes -0.681771
1 -0.871787 NaN yes -0.871787
2 -0.805301 NaN no
3 1.264103 NaN maybe
Solution 2
In case this solution doesn't work for anyone, another situation that happened to me was that even though I was reading all data in as dtype=str
(and therefore doing any string comparison should be OK [ie df[col] == "some string"
]), I had a column of all nulls, which becomes type float
, which will give an error when comparing to a string.
To get around that, you can use .astype(str)
to ensure a string to string comparison will be performed.
dreab
Updated on August 03, 2022Comments
-
dreab almost 2 years
I am using pandas and want to select subsets of data and apply it to other columns. e.g.
- if there is data in column A; &
- if there is NO data in column B;
- then, apply the data in column A to column D
I have this working fine for now using
.isnull()
and.notnull()
. e.g.df = pd.DataFrame({'A' : pd.Series(np.random.randn(4)), 'B' : pd.Series(np.nan), 'C' : pd.Series(['yes','yes','no','maybe'])}) df['D']='' df Out[44]: A B C D 0 0.516752 NaN yes 1 -0.513194 NaN yes 2 0.861617 NaN no 3 -0.026287 NaN maybe # Now try the first conditional expression df['D'][df['A'].notnull() & df['B'].isnull()] \ = df['A'][df['A'].notnull() & df['B'].isnull()] df Out[46]: A B C D 0 0.516752 NaN yes 0.516752 1 -0.513194 NaN yes -0.513194 2 0.861617 NaN no 0.861617 3 -0.026287 NaN maybe -0.0262874
When one adds a third condition, to also check whether data in column C matches a particular string, we get the error:
df['D'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes'] \ = df['A'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes'] File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 763, in wrapper res = na_op(values, other) File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 718, in na_op raise TypeError("invalid type comparison") TypeError: invalid type comparison
I have read that this occurs due to the different datatypes. And I can get it working if I change all the strings in column C for integers or booleans. We also know that string on its own would work, e.g.
df['A'][df['B']=='yes']
gives a boolean list.So any ideas how/why this is not working when combining these datatypes in this conditional expression? What are the more pythonic ways to do what appears to be quite long-winded?
Thanks