When to apply(pd.to_numeric) and when to astype(np.float64) in python?

71,250

Solution 1

If you already have numeric dtypes (int8|16|32|64,float64,boolean) you can convert it to another "numeric" dtype using Pandas .astype() method.

Demo:

In [90]: df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), dtype=np.int64)

In [91]: df
Out[91]:
         a        b        c
0  9059440  9590567  2076918
1  5861102  4566089  1947323
2  6636568   162770  2487991
3  6794572  5236903  5628779
4   470121  4044395  4546794

In [92]: df.dtypes
Out[92]:
a    int64
b    int64
c    int64
dtype: object

In [93]: df['a'] = df['a'].astype(float)

In [94]: df.dtypes
Out[94]:
a    float64
b      int64
c      int64
dtype: object

It won't work for object (string) dtypes, that can't be converted to numbers:

In [95]: df.loc[1, 'b'] = 'XXXXXX'

In [96]: df
Out[96]:
           a        b        c
0  9059440.0  9590567  2076918
1  5861102.0   XXXXXX  1947323
2  6636568.0   162770  2487991
3  6794572.0  5236903  5628779
4   470121.0  4044395  4546794

In [97]: df.dtypes
Out[97]:
a    float64
b     object
c      int64
dtype: object

In [98]: df['b'].astype(float)
...
skipped
...
ValueError: could not convert string to float: 'XXXXXX'

So here we want to use pd.to_numeric() method:

In [99]: df['b'] = pd.to_numeric(df['b'], errors='coerce')

In [100]: df
Out[100]:
           a          b        c
0  9059440.0  9590567.0  2076918
1  5861102.0        NaN  1947323
2  6636568.0   162770.0  2487991
3  6794572.0  5236903.0  5628779
4   470121.0  4044395.0  4546794

In [101]: df.dtypes
Out[101]:
a    float64
b    float64
c      int64
dtype: object

Solution 2

I don't have a technical explanation for this but, I have noticed that pd.to_numeric() raises the following error when converting the string 'nan':

In [10]: df = pd.DataFrame({'value': 'nan'}, index=[0])

In [11]: pd.to_numeric(df.value)

Traceback (most recent call last):

  File "<ipython-input-11-98729d13e45c>", line 1, in <module>
    pd.to_numeric(df.value)

  File "C:\Users\joshua.lee\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\tools\numeric.py", line 133, in to_numeric
    coerce_numeric=coerce_numeric)

  File "pandas/_libs/src\inference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numeric

ValueError: Unable to parse string "nan" at position 0

whereas astype(float) does not:

df.value.astype(float)
Out[12]: 
0   NaN
Name: value, dtype: float64

Solution 3

You can use this:

pd.to_numeric(df.value, errors='coerce').fillna(0, downcast='infer')  

It will use zero in place of nan.

Solution 4

I observed that I was able to convert object(str) to float first and then float to Int64.

df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), 
dtype=np.int64)
df['a'] = df['a'].astype('str')
df.dtypes

df['a'] = df['a'].astype('float')
df['a'] = df['a'].astype('int64')

Worked fine.

Share:
71,250

Related videos on Youtube

d8aninja
Author by

d8aninja

Americanadian dad(bod). Him/they. Star Wars/Trek. Whiskey/whiskeys. 7y @USArmy Vet. Trying to be good at devsecops.

Updated on July 09, 2022

Comments

  • d8aninja
    d8aninja almost 2 years

    I have a pandas DataFrame object named xiv which has a column of int64 Volume measurements.

    In[]: xiv['Volume'].head(5)
    Out[]: 
    
    0    252000
    1    484000
    2     62000
    3    168000
    4    232000
    Name: Volume, dtype: int64
    

    I have read other posts (like this and this) that suggest the following solutions. But when I use either approach, it doesn't appear to change the dtype of the underlying data:

    In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])
    
    In[]: xiv['Volume'].dtypes
    Out[]: 
    dtype('int64')
    

    Or...

    In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])
    Out[]: ###omitted for brevity###
    
    In[]: xiv['Volume'].dtypes
    Out[]: 
    dtype('int64')
    
    In[]: xiv['Volume'] = xiv['Volume'].apply(pd.to_numeric)
    
    In[]: xiv['Volume'].dtypes
    Out[]: 
    dtype('int64')
    

    I've also tried making a separate pandas Series and using the methods listed above on that Series and reassigning to the x['Volume'] obect, which is a pandas.core.series.Series object.

    I have, however, found a solution to this problem using the numpy package's float64 type - this works but I don't know why it's different.

    In[]: xiv['Volume'] = xiv['Volume'].astype(np.float64)
    
    In[]: xiv['Volume'].dtypes
    Out[]: 
    dtype('float64') 
    

    Can someone explain how to accomplish with the pandas library what the numpy library seems to do easily with its float64 class; that is, convert the column in the xiv DataFrame to a float64 in place.

    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA over 7 years
      int64 is already "numeric" dtype. to_numeric() should help to convert strings into numeric dtypes...
    • d8aninja
      d8aninja over 7 years
      the cited post shows the dtype returned by calling to_numeric will be float64...
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA over 7 years
      Check this: pd.to_numeric(pd.Series(['1','2','3'])).dtype. It'll be float64 only if it's necessary: 1. there is/are NaN's or non-convertable values in the Series. 2. there are floats in the series
    • d8aninja
      d8aninja over 7 years
      Understand that this is producing the problem I gave, but how does it address the question of why the numpy solution works instead?
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA over 7 years
      What is the "problem" and what is/are your goal(s)? BTW pd.Series.astype(np.float64) - is a Pandas method
    • d8aninja
      d8aninja over 7 years
      @MaxU check my edit at the bottom?
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA over 7 years
      I've added some demos - I hope it got bit clearer now...
  • OnlySalman
    OnlySalman over 5 years
    Just a Question why did you write df.b, not df['b']?
  • Punit
    Punit over 5 years
    @SalmanALharbi: its always better to use column this format df['column name'] but sometimes if column name doesn't space you can use the column name directly as df.columnname but df['column name'] should be used always
  • Guzman Ojero
    Guzman Ojero over 2 years
    @OnlySalman it's just a different and valid syntax. Some people prefer df['column_name] and others prefer df.column_name. With df.column_name the column name must be a valid python syntax. For eg: if column name starts with a number, you can't use dot notation. You can't do df.1, you have to do df[1]
  • Green Noob
    Green Noob about 2 years
    My guess - it's because this is what python does when we try to convert the string "nan" to a float using float("nan"). The result is a float that represents a quantity that is non a number.