Convert Pandas column containing NaNs to dtype `int`

308,914

Solution 1

The lack of NaN rep in integer columns is a pandas "gotcha".

The usual workaround is to simply use floats.

Solution 2

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')

Solution 3

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!

Solution 4

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values

Solution 5

Whether your pandas series is object datatype or simply float datatype the below method will work

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float).astype('Int64')
Share:
308,914

Related videos on Youtube

Zhubarb
Author by

Zhubarb

When you stare long into the thesys, the thesys stares back into you.

Updated on July 08, 2022

Comments

  • Zhubarb
    Zhubarb 12 months

    I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

    When I try to cast the id column to integer while reading the .csv, I get:

    df= pd.read_csv("data.csv", dtype={'id': int}) 
    error: Integer column has NA values
    

    Alternatively, I tried to convert the column type after reading as below, but this time I get:

    df= pd.read_csv("data.csv") 
    df[['id']] = df[['id']].astype(int)
    error: Cannot convert NA to integer
    

    How can I tackle this?

    • Alvaro Fuentes
      Alvaro Fuentes over 9 years
      Could you post the content of your file?
    • Zhubarb
      Zhubarb over 9 years
      @xndrme, the file itself is too large. I will see if I can create a small test case. But essentially the situation is that the id column has many integer values and some empty/missing cells.
    • EdChum
      EdChum over 9 years
      I think that integer values cannot be converted or stored in a series/dataframe if there are missing/NaN values. This I think is to do with numpy compatibility (I'm guessing here), if you want missing value compatibility then I would store the values as floats
    • Jeff
      Jeff over 9 years
      see here: pandas.pydata.org/pandas-docs/dev/…; you must have a float dtype when u have missing values (or technically object dtype but that is inefficient); what is your goal of using int type?
    • Jeff
      Jeff over 9 years
      FYI, if you don't specify a dtype, then pandas will infer float for the column, no conversion needed.
    • ely
      ely over 9 years
      I believe this is a NumPy issue, not specific to Pandas. It's a shame since there are so many cases when having an int type that allows for the possibility of null values is much more efficient than a large column of floats.
    • dermen
      dermen almost 8 years
      I have a problem with this too. I have multiple dataframes which I want to merge based on a string representation of several "integer" columns. However, when one of those integer columns has a np.nan, the string casting produces a ".0", which throws off the merge. Just makes things slightly more complicated, would be nice if there was simple work-around.
    • mork
      mork over 4 years
      @Rhubarb, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. pandas 0.24.x release notes
  • NumenorForLife
    NumenorForLife about 8 years
    Are there any other workarounds besides treating them like floats?
  • Andy Hayden
    Andy Hayden about 8 years
    @jsc123 you can use the object dtype. This comes with a small health warning but for the most part works well.
  • MikeyE
    MikeyE almost 7 years
    Can you provide an example of how to use object dtype? I've been looking through the pandas docs and googling, and I've read it's the recommended method. But, I haven't found an example of how to use the object dtype.
  • Chris Decker
    Chris Decker over 4 years
    I have been pulling my hair out trying to load serial numbers where some are null and the rest are floats, this saved me.
  • Rishab Gupta
    Rishab Gupta over 4 years
    The OP wants a column of integers. Converting it to string does not meet the condition.
  • cs95
    cs95 about 4 years
    In v0.24, you can now do df = df.astype(pd.Int32Dtype()) (to convert the entire dataFrame, or) df['col'] = df['col'].astype(pd.Int32Dtype()). Other accepted nullable integer types are pd.Int16Dtype and pd.Int64Dtype. Pick your poison.
  • Winston
    Winston almost 4 years
    It is NaN value but isnan checking doesn't work at all :(
  • Viacheslav Zhukov
    Viacheslav Zhukov over 3 years
    Note that dtype must be "Int64" and not "int64" (first 'i' must be capitalized)
  • Sharvari Gc
    Sharvari Gc over 3 years
    Works only if col doesn't already have -1. Otherwise, it will mess with the data
  • LoMaPh
    LoMaPh over 3 years
    df.myCol = df.myCol.astype('Int64') or df['myCol'] = df['myCol'].astype('Int64')
  • abdoulsn
    abdoulsn over 3 years
    then how to get back to int..??
  • Jeremy Caney
    Jeremy Caney about 3 years
    Is there a reason you prefer this formulation over that proposed in the accepted answer? If so, it'd be useful to edit your answer to provide that explanation—and especially since there are ten additional answers that are competing for attention.
  • SherylHohman
    SherylHohman about 3 years
    While this code may resolve the OP's issue, it is best to include an explanation as to how/why your code addresses it. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation.
  • Bradon
    Bradon almost 3 years
    If there are NaN values in the column, pd.to_numeric will convert the dtype to float not int because NaN is considered a float.
  • wfgeo
    wfgeo almost 3 years
    It may be obvious to some but it I think it is still worth noting that you can use any Int (e.g. Int16, Int32) and indeed probably should if the dataframe is very large to save memory.
  • Newbielp
    Newbielp over 2 years
    @jezrael, in which cases this does not work...? It does not work for me and I have failed to find a generic solution.
  • jezrael
    jezrael over 2 years
    @Newbielp - First idea is test if nan are missing values or strings 'nan'
  • Newbielp
    Newbielp over 2 years
    @jezrael, it gives me type float for specific element. Whole column is type object.
  • zelusp
    zelusp over 2 years
    this approach can add a lot of memory overhead, especially on larger dataframes
  • BERA
    BERA almost 2 years
    I'm getting TypeError: cannot safely cast non-equivalent float64 to int64
  • PatrickT
    PatrickT over 1 year
  • PatrickT
    PatrickT over 1 year
    This produces a column of strings!! For a solution with current versions of pandas, see stackoverflow.com/questions/58029359/…
  • Henrique Brisola
    Henrique Brisola over 1 year
    @cs95 I am getting erro object cannot be converted to an IntegerDtype
  • robertspierre
    robertspierre over 1 year
    @Zhubarb please consider changing your accepted answer in light of this development
  • buhtz
    buhtz over 1 year
    Please explain your solution.
  • Jane Kathambi
    Jane Kathambi over 1 year
    Thank you @Abhishek Bhatia this worked for me.
  • Jane Kathambi
    Jane Kathambi over 1 year
    Works but I think replacing NaN with 0 changes the meaning of the data.
  • bsauce
    bsauce over 1 year
    caution with this approach... if any of your data really is -1, it will be overwritten.