NumPy or Pandas: Keeping array type as integer while having a NaN value

91,137

Solution 1

This capability has been added to pandas (beginning with version 0.24): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

At this point, it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lowercase).

Solution 2

NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )

Solution 3

If performance is not the main issue, you can store strings instead.

df.col = df.col.dropna().apply(lambda x: str(int(x)) )

Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.

You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.

Solution 4

This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN

a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected

Solution 5

In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0      1
1      2
2    NaN
dtype: Int64

My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.

Share:
91,137

Related videos on Youtube

ely
Author by

ely

?- love(math) is unrequited. true.

Updated on July 08, 2022

Comments

  • ely
    ely almost 2 years

    Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?

    In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.

    Thoughts?

    Things tried:

    I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.

    • mgilson
      mgilson almost 12 years
      Could you use a numpy masked array?
    • ely
      ely almost 12 years
      I'll give it a try. I also tried the from_records function under pandas.DataFrame, with coerce_float=False, but no luck... it still makes the new data have type float64.
    • ely
      ely almost 12 years
      Yeah, no luck. Even with masked array, it still converts to float. It's looking like Pandas goes like this: "Is there a NaN anywhere? ... Then everything's a float." Hopefully there is a way around this.
    • mork
      mork over 5 years
      Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. pandas 0.24.x release notes
  • Carst
    Carst almost 11 years
    Hi Wes, is there any update on this? We run into issues that join columns are converted into either ints or floats, based on the existence of a NA value in the original list. (Creating issues later on when trying to merge these dataframes)
  • techvslife
    techvslife over 5 years
  • Jean Paul
    Jean Paul over 5 years
    For now you have to specify a special dtype like 'Int64' to make it work. It will be even better when it will be enabled by default.
  • Alaa M.
    Alaa M. about 5 years
    This is great! There's a small issue though that PyCharm fails to display the dataframe in the debug window if used this way. You can see my answer for another question for how to force displaying it: stackoverflow.com/questions/38956660/… (the original problem there is different, but the solution for displaying the dataframe works)
  • Superdooperhero
    Superdooperhero over 4 years
    Do I have to use 'Int64' or is there something like 'Int8'? It uses an insane amount of memory compared to np.float.
  • Superdooperhero
    Superdooperhero over 4 years
    'Int8' seems to work, but np.float still seems to load way faster. Issue seems to be that it isn't releasing memory inbetween. Assume garbage collector will eventually run.
  • Alexander Santos
    Alexander Santos almost 2 years
    For future seekers, i was receiving errors with this approach. Then i noticed there was a difference on the case for the integer. Note that Int64 != int64. Hope it helps someone