Convert Pandas column containing NaNs to dtype `int`

python pandas dataframe dtype

308,914

Solution 1

The lack of NaN rep in integer columns is a pandas "gotcha".

The usual workaround is to simply use floats.

Solution 2

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')

Solution 3

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!

Solution 4

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values

Solution 5

Whether your pandas series is object datatype or simply float datatype the below method will work

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float).astype('Int64')

View more solutions

308,914

Zhubarb

When you stare long into the thesys, the thesys stares back into you.

Updated on July 08, 2022

Comments

Zhubarb 12 months
I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:
```
df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values
```
Alternatively, I tried to convert the column type after reading as below, but this time I get:
```
df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
```
How can I tackle this?
- Alvaro Fuentes over 9 years
  
  Could you post the content of your file?
- Zhubarb over 9 years
  
  @xndrme, the file itself is too large. I will see if I can create a small test case. But essentially the situation is that the id column has many integer values and some empty/missing cells.
- EdChum over 9 years
  
  I think that integer values cannot be converted or stored in a series/dataframe if there are missing/NaN values. This I think is to do with numpy compatibility (I'm guessing here), if you want missing value compatibility then I would store the values as floats
- Jeff over 9 years
  
  see here: pandas.pydata.org/pandas-docs/dev/…; you must have a float dtype when u have missing values (or technically object dtype but that is inefficient); what is your goal of using int type?
- Jeff over 9 years
  
  FYI, if you don't specify a dtype, then pandas will infer float for the column, no conversion needed.
- ely over 9 years
  
  I believe this is a NumPy issue, not specific to Pandas. It's a shame since there are so many cases when having an int type that allows for the possibility of null values is much more efficient than a large column of floats.
- dermen almost 8 years
  
  I have a problem with this too. I have multiple dataframes which I want to merge based on a string representation of several "integer" columns. However, when one of those integer columns has a np.nan, the string casting produces a ".0", which throws off the merge. Just makes things slightly more complicated, would be nice if there was simple work-around.
- mork over 4 years
  
  @Rhubarb, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. pandas 0.24.x release notes
NumenorForLife about 8 years

Are there any other workarounds besides treating them like floats?
Andy Hayden about 8 years

@jsc123 you can use the object dtype. This comes with a small health warning but for the most part works well.
MikeyE almost 7 years

Can you provide an example of how to use object dtype? I've been looking through the pandas docs and googling, and I've read it's the recommended method. But, I haven't found an example of how to use the object dtype.
Chris Decker over 4 years

I have been pulling my hair out trying to load serial numbers where some are null and the rest are floats, this saved me.
Rishab Gupta over 4 years

The OP wants a column of integers. Converting it to string does not meet the condition.
cs95 about 4 years

In v0.24, you can now do df = df.astype(pd.Int32Dtype()) (to convert the entire dataFrame, or) df['col'] = df['col'].astype(pd.Int32Dtype()). Other accepted nullable integer types are pd.Int16Dtype and pd.Int64Dtype. Pick your poison.
Winston almost 4 years

It is NaN value but isnan checking doesn't work at all :(
Viacheslav Zhukov over 3 years

Note that dtype must be "Int64" and not "int64" (first 'i' must be capitalized)
Sharvari Gc over 3 years

Works only if col doesn't already have -1. Otherwise, it will mess with the data
LoMaPh over 3 years

df.myCol = df.myCol.astype('Int64') or df['myCol'] = df['myCol'].astype('Int64')
abdoulsn over 3 years

then how to get back to int..??
Jeremy Caney about 3 years

Is there a reason you prefer this formulation over that proposed in the accepted answer? If so, it'd be useful to edit your answer to provide that explanation—and especially since there are ten additional answers that are competing for attention.
SherylHohman about 3 years

While this code may resolve the OP's issue, it is best to include an explanation as to how/why your code addresses it. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation.
Bradon almost 3 years

If there are NaN values in the column, pd.to_numeric will convert the dtype to float not int because NaN is considered a float.
wfgeo almost 3 years

It may be obvious to some but it I think it is still worth noting that you can use any Int (e.g. Int16, Int32) and indeed probably should if the dataframe is very large to save memory.
Newbielp over 2 years

@jezrael, in which cases this does not work...? It does not work for me and I have failed to find a generic solution.
jezrael over 2 years

@Newbielp - First idea is test if nan are missing values or strings 'nan'
Newbielp over 2 years

@jezrael, it gives me type float for specific element. Whole column is type object.
zelusp over 2 years

this approach can add a lot of memory overhead, especially on larger dataframes
BERA almost 2 years

I'm getting TypeError: cannot safely cast non-equivalent float64 to int64
PatrickT over 1 year

See stackoverflow.com/questions/58029359/…
PatrickT over 1 year

This produces a column of strings!! For a solution with current versions of pandas, see stackoverflow.com/questions/58029359/…
Henrique Brisola over 1 year

@cs95 I am getting erro object cannot be converted to an IntegerDtype
robertspierre over 1 year

@Zhubarb please consider changing your accepted answer in light of this development
buhtz over 1 year

Please explain your solution.
Jane Kathambi over 1 year

Thank you @Abhishek Bhatia this worked for me.
Jane Kathambi over 1 year

Works but I think replacing NaN with 0 changes the meaning of the data.
bsauce over 1 year

caution with this approach... if any of your data really is -1, it will be overwritten.