pandas table subsets giving invalid type comparison error

16,165

Solution 1

I think you need add parentheses () to conditions, also better is use ix for selecting column with boolean mask which can be assigned to variable mask:

mask = (df['A'].notnull()) & (df['B'].isnull()) & (df['C']=='yes')
print (mask)
0     True
1     True
2    False
3    False
dtype: bool

df.ix[mask, 'D'] = df.ix[mask, 'A']

print (df)
          A   B      C         D
0 -0.681771 NaN    yes -0.681771
1 -0.871787 NaN    yes -0.871787
2 -0.805301 NaN     no          
3  1.264103 NaN  maybe   

Solution 2

In case this solution doesn't work for anyone, another situation that happened to me was that even though I was reading all data in as dtype=str (and therefore doing any string comparison should be OK [ie df[col] == "some string"]), I had a column of all nulls, which becomes type float, which will give an error when comparing to a string.

To get around that, you can use .astype(str) to ensure a string to string comparison will be performed.

Share:
16,165
dreab
Author by

dreab

Updated on August 03, 2022

Comments

  • dreab
    dreab almost 2 years

    I am using pandas and want to select subsets of data and apply it to other columns. e.g.

    • if there is data in column A; &
    • if there is NO data in column B;
    • then, apply the data in column A to column D

    I have this working fine for now using .isnull() and .notnull(). e.g.

    df = pd.DataFrame({'A' : pd.Series(np.random.randn(4)),
                           'B' : pd.Series(np.nan),
                           'C' : pd.Series(['yes','yes','no','maybe'])})
    df['D']=''
    
    df
    Out[44]: 
              A   B      C D
    0  0.516752 NaN    yes  
    1 -0.513194 NaN    yes  
    2  0.861617 NaN     no  
    3 -0.026287 NaN  maybe  
    
    # Now try the first conditional expression
    df['D'][df['A'].notnull() & df['B'].isnull()] \
    =  df['A'][df['A'].notnull() & df['B'].isnull()]   
    df
    Out[46]: 
              A   B      C          D
    0  0.516752 NaN    yes   0.516752
    1 -0.513194 NaN    yes  -0.513194
    2  0.861617 NaN     no   0.861617
    3 -0.026287 NaN  maybe -0.0262874
    

    When one adds a third condition, to also check whether data in column C matches a particular string, we get the error:

    df['D'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes'] \
    =  df['A'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes']   
    
    
      File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 763, in wrapper
        res = na_op(values, other)
    
      File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 718, in na_op
        raise TypeError("invalid type comparison")
    
    TypeError: invalid type comparison
    

    I have read that this occurs due to the different datatypes. And I can get it working if I change all the strings in column C for integers or booleans. We also know that string on its own would work, e.g. df['A'][df['B']=='yes'] gives a boolean list.

    So any ideas how/why this is not working when combining these datatypes in this conditional expression? What are the more pythonic ways to do what appears to be quite long-winded?

    Thanks