Python pandas: how to remove nan and -inf values

130,220

Solution 1

Use pd.DataFrame.isin and check for rows that have any with pd.DataFrame.any. Finally, use the boolean array to slice the dataframe.

df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]

             time    X    Y  X_t0     X_tp0   X_t1     X_tp1   X_t2     X_tp2
4        0.037389    3   10     3  0.333333    2.0  0.500000    1.0  1.000000
5        0.037393    4   10     4  0.250000    3.0  0.333333    2.0  0.500000
1030308  9.962213  256  268   256  0.000000  256.0  0.003906  255.0  0.003922

Solution 2

You can replace inf and -inf with NaN, and then select non-null rows.

df[df.replace([np.inf, -np.inf], np.nan).notnull().all(axis=1)]  # .astype(np.float64) ?

or

df.replace([np.inf, -np.inf], np.nan).dropna(axis=1)

Check the type of your columns returns to make sure they are all as expected (e.g. np.float32/64) via df.info().

Solution 3

df.replace([np.inf, -np.inf], np.nan)

df.dropna(inplace=True)

Solution 4

Instead of dropping rows which contain any nulls and infinite numbers, it is more succinct to the reverse the logic of that and instead return the rows where all cells are finite numbers. The numpy isfinite function does this and the '.all(1)' will only return a TRUE if all cells in row are finite.

df = df[np.isfinite(df).all(1)]

Solution 5

I prefer to set the options so that inf values are calculated to nan;

s1 = pd.Series([0, 1, 2])
s2 = pd.Series([2, 1, 0])
s1/s2
# Outputs:
# 0.0
# 1.0
# inf
# dtype: float64

pd.set_option('mode.use_inf_as_na', True)
s1/s2
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64

Note you can also use context;

with pd.option_context('mode.use_inf_as_na', True):
    print(s1/s2)
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64
Share:
130,220
Admin
Author by

Admin

Updated on July 23, 2022

Comments

  • Admin
    Admin almost 2 years

    I have the following dataframe

               time       X    Y  X_t0     X_tp0  X_t1     X_tp1  X_t2     X_tp2
    0         0.002876    0   10     0       NaN   NaN       NaN   NaN       NaN
    1         0.002986    0   10     0       NaN     0       NaN   NaN       NaN
    2         0.037367    1   10     1  1.000000     0       NaN     0       NaN
    3         0.037374    2   10     2  0.500000     1  1.000000     0       NaN
    4         0.037389    3   10     3  0.333333     2  0.500000     1  1.000000
    5         0.037393    4   10     4  0.250000     3  0.333333     2  0.500000
    
    ....
    1030308   9.962213  256  268   256  0.000000   256  0.003906   255  0.003922
    1030309  10.041799    0  268     0      -inf   256  0.000000   256  0.003906
    1030310  10.118960    0  268     0       NaN     0      -inf   256  0.000000
    

    I tried with the following

    df.dropna(inplace=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
    
    X_train = X_train.drop('time', axis=1)
    X_train = X_train.drop('X_t1', axis=1)
    X_train = X_train.drop('X_t2', axis=1)
    X_test = X_test.drop('time', axis=1)
    X_test = X_test.drop('X_t1', axis=1)
    X_test = X_test.drop('X_t2', axis=1)
    X_test.fillna(X_test.mean(), inplace=True)
    X_train.fillna(X_train.mean(), inplace=True)
    y_train.fillna(y_train.mean(), inplace=True)
    

    However, I am still getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). whenever i try to fit a regression model fit(X_train, y_train)

    How can we remove both the NaN and -inf values at the same time?