NumPy or Pandas: Keeping array type as integer while having a NaN value
Solution 1
This capability has been added to pandas (beginning with version 0.24): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
At this point, it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lowercase).
Solution 2
NaN
can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )
Solution 3
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN
as much as you want. If you really want to have integers, depending on your application, you can use -1
, or 0
, or 1234567890
, or some other dedicated value to represent NaN
.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts
in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
Solution 4
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Solution 5
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
Related videos on Youtube
Comments
-
ely almost 2 years
Is there a preferred way to keep the data type of a
numpy
array fixed asint
(orint64
or whatever), while still having an element inside listed asnumpy.NaN
?In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be
int
.Thoughts?
Things tried:
I tried using the
from_records()
function under pandas.DataFrame, withcoerce_float=False
and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.-
mgilson almost 12 yearsCould you use a numpy masked array?
-
ely almost 12 yearsI'll give it a try. I also tried the
from_records
function under pandas.DataFrame, withcoerce_float=False
, but no luck... it still makes the new data have typefloat64
. -
ely almost 12 yearsYeah, no luck. Even with masked array, it still converts to float. It's looking like Pandas goes like this: "Is there a NaN anywhere? ... Then everything's a float." Hopefully there is a way around this.
-
mork over 5 yearsOptional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. pandas 0.24.x release notes
-
-
Carst almost 11 yearsHi Wes, is there any update on this? We run into issues that join columns are converted into either ints or floats, based on the existence of a NA value in the original list. (Creating issues later on when trying to merge these dataframes)
-
techvslife over 5 yearsUpdated link: pandas-docs.github.io/pandas-docs-travis/whatsnew/…
-
Jean Paul over 5 yearsFor now you have to specify a special dtype like
'Int64'
to make it work. It will be even better when it will be enabled by default. -
Alaa M. about 5 yearsThis is great! There's a small issue though that PyCharm fails to display the dataframe in the debug window if used this way. You can see my answer for another question for how to force displaying it: stackoverflow.com/questions/38956660/… (the original problem there is different, but the solution for displaying the dataframe works)
-
Superdooperhero over 4 yearsDo I have to use
'Int64'
or is there something like'Int8'
? It uses an insane amount of memory compared tonp.float
. -
Superdooperhero over 4 years
'Int8'
seems to work, butnp.float
still seems to load way faster. Issue seems to be that it isn't releasing memory inbetween. Assume garbage collector will eventually run. -
Alexander Santos almost 2 yearsFor future seekers, i was receiving errors with this approach. Then i noticed there was a difference on the case for the integer. Note that
Int64
!=int64
. Hope it helps someone