When to apply(pd.to_numeric) and when to astype(np.float64) in python?
Solution 1
If you already have numeric dtypes (int8|16|32|64
,float64
,boolean
) you can convert it to another "numeric" dtype using Pandas .astype() method.
Demo:
In [90]: df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), dtype=np.int64)
In [91]: df
Out[91]:
a b c
0 9059440 9590567 2076918
1 5861102 4566089 1947323
2 6636568 162770 2487991
3 6794572 5236903 5628779
4 470121 4044395 4546794
In [92]: df.dtypes
Out[92]:
a int64
b int64
c int64
dtype: object
In [93]: df['a'] = df['a'].astype(float)
In [94]: df.dtypes
Out[94]:
a float64
b int64
c int64
dtype: object
It won't work for object
(string) dtypes, that can't be converted to numbers:
In [95]: df.loc[1, 'b'] = 'XXXXXX'
In [96]: df
Out[96]:
a b c
0 9059440.0 9590567 2076918
1 5861102.0 XXXXXX 1947323
2 6636568.0 162770 2487991
3 6794572.0 5236903 5628779
4 470121.0 4044395 4546794
In [97]: df.dtypes
Out[97]:
a float64
b object
c int64
dtype: object
In [98]: df['b'].astype(float)
...
skipped
...
ValueError: could not convert string to float: 'XXXXXX'
So here we want to use pd.to_numeric() method:
In [99]: df['b'] = pd.to_numeric(df['b'], errors='coerce')
In [100]: df
Out[100]:
a b c
0 9059440.0 9590567.0 2076918
1 5861102.0 NaN 1947323
2 6636568.0 162770.0 2487991
3 6794572.0 5236903.0 5628779
4 470121.0 4044395.0 4546794
In [101]: df.dtypes
Out[101]:
a float64
b float64
c int64
dtype: object
Solution 2
I don't have a technical explanation for this but, I have noticed that pd.to_numeric() raises the following error when converting the string 'nan':
In [10]: df = pd.DataFrame({'value': 'nan'}, index=[0])
In [11]: pd.to_numeric(df.value)
Traceback (most recent call last):
File "<ipython-input-11-98729d13e45c>", line 1, in <module>
pd.to_numeric(df.value)
File "C:\Users\joshua.lee\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\tools\numeric.py", line 133, in to_numeric
coerce_numeric=coerce_numeric)
File "pandas/_libs/src\inference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "nan" at position 0
whereas astype(float) does not:
df.value.astype(float)
Out[12]:
0 NaN
Name: value, dtype: float64
Solution 3
You can use this:
pd.to_numeric(df.value, errors='coerce').fillna(0, downcast='infer')
It will use zero in place of nan.
Solution 4
I observed that I was able to convert object(str) to float first and then float to Int64.
df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'),
dtype=np.int64)
df['a'] = df['a'].astype('str')
df.dtypes
df['a'] = df['a'].astype('float')
df['a'] = df['a'].astype('int64')
Worked fine.
Related videos on Youtube
![d8aninja](https://i.stack.imgur.com/MIqzi.jpg?s=256&g=1)
d8aninja
Americanadian dad(bod). Him/they. Star Wars/Trek. Whiskey/whiskeys. 7y @USArmy Vet. Trying to be good at devsecops.
Updated on July 09, 2022Comments
-
d8aninja almost 2 years
I have a pandas DataFrame object named
xiv
which has a column ofint64
Volume measurements.In[]: xiv['Volume'].head(5) Out[]: 0 252000 1 484000 2 62000 3 168000 4 232000 Name: Volume, dtype: int64
I have read other posts (like this and this) that suggest the following solutions. But when I use either approach, it doesn't appear to change the
dtype
of the underlying data:In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume']) In[]: xiv['Volume'].dtypes Out[]: dtype('int64')
Or...
In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume']) Out[]: ###omitted for brevity### In[]: xiv['Volume'].dtypes Out[]: dtype('int64') In[]: xiv['Volume'] = xiv['Volume'].apply(pd.to_numeric) In[]: xiv['Volume'].dtypes Out[]: dtype('int64')
I've also tried making a separate pandas
Series
and using the methods listed above on that Series and reassigning to thex['Volume']
obect, which is apandas.core.series.Series
object.I have, however, found a solution to this problem using the
numpy
package'sfloat64
type - this works but I don't know why it's different.In[]: xiv['Volume'] = xiv['Volume'].astype(np.float64) In[]: xiv['Volume'].dtypes Out[]: dtype('float64')
Can someone explain how to accomplish with the
pandas
library what thenumpy
library seems to do easily with itsfloat64
class; that is, convert the column in thexiv
DataFrame to afloat64
in place.-
MaxU - stop genocide of UA over 7 years
int64
is already "numeric" dtype.to_numeric()
should help to convert strings into numeric dtypes... -
d8aninja over 7 yearsthe cited post shows the
dtype
returned by callingto_numeric
will befloat64
... -
MaxU - stop genocide of UA over 7 yearsCheck this:
pd.to_numeric(pd.Series(['1','2','3'])).dtype
. It'll be float64 only if it's necessary: 1. there is/are NaN's or non-convertable values in the Series. 2. there are floats in the series -
d8aninja over 7 yearsUnderstand that this is producing the problem I gave, but how does it address the question of why the numpy solution works instead?
-
MaxU - stop genocide of UA over 7 yearsWhat is the "problem" and what is/are your goal(s)? BTW pd.Series.astype(np.float64) - is a Pandas method
-
d8aninja over 7 years@MaxU check my edit at the bottom?
-
MaxU - stop genocide of UA over 7 yearsI've added some demos - I hope it got bit clearer now...
-
-
OnlySalman over 5 yearsJust a Question why did you write df.b, not df['b']?
-
Punit over 5 years@SalmanALharbi: its always better to use column this format df['column name'] but sometimes if column name doesn't space you can use the column name directly as df.columnname but df['column name'] should be used always
-
Guzman Ojero over 2 years@OnlySalman it's just a different and valid syntax. Some people prefer df['column_name] and others prefer df.column_name. With df.column_name the column name must be a valid python syntax. For eg: if column name starts with a number, you can't use dot notation. You can't do df.1, you have to do df[1]
-
Green Noob about 2 yearsMy guess - it's because this is what python does when we try to convert the string "nan" to a float using float("nan"). The result is a float that represents a quantity that is non a number.