Using read_excel with converters for reading Excel file into Pandas DataFrame results in a numeric column of object type

10,556

Solution 1

Let's remove the converters argument for a moment -

c = ['Energy Supply', 'Energy Supply per Capita', '% Renewable']
df = pd.read_excel("Energy Indicators.xls", 
                   skiprows=17, 
                   skip_footer=38, 
                   usecols=[2,3,4,5], 
                   na_values=['...'], 
                   names=c,
                   index_col=[0])

df.index.name = 'Country'
df.head()    
                Energy Supply  Energy Supply per Capita  % Renewable
Country                                                             
Afghanistan             321.0                      10.0    78.669280
Albania                 102.0                      35.0   100.000000
Algeria                1959.0                      51.0     0.551010
American Samoa            NaN                       NaN     0.641026
Andorra                   9.0                     121.0    88.695650

df.dtypes

Energy Supply               float64
Energy Supply per Capita    float64
% Renewable                 float64
dtype: object

Your data loads just fine without a converter. There's a trick to understanding why this happens.

By default, pandas will read in the column and try to "interpret" your data. By specifying your own converter, you override pandas conversion, so this does not happen.

pandas passes integer and string values to convert_energy, so the isinstance(energy, float) is never evaluated to True. Instead, the else runs, and these values are returned as is, so your resultant column is a mixture of strings and integers. If you put a print(type(energy)) inside your function, this becomes obvious.

Since you have mixtures of types, the resultant type is object. However, if you do not use a converter, pandas will attempt to interpret your data, and will successfully parse it to numeric.

So, just doing -

df['Energy Supply'] *= 1000000

Would be more than enough.

Solution 2

One of the values for energy in your excel file is a string "..." and when in your coverter function, you just return energy as is if it is a string datatype.

Therefore you are getting a string returned along with your numbers which then changes the dtype of you column to 'object.

You could try something like this:

def convert_energy(energy):
    if energy == "...":
        return np.nan
    elif isinstance(energy, float):
        return float(energy*1000000)
    else:
        return float(energy)

df = pd.read_excel('http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls', 
                   skiprows=17, skip_footer=38, 
                   usecols=[2,3,4,5], na_values=['...'], 
                   names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'],
                   converters={1: convert_energy}).set_index('Country')

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 227 entries, Afghanistan to Zimbabwe
Data columns (total 3 columns):
Energy Supply               222 non-null float64
Energy Supply per Capita    222 non-null float64
% Renewable                 227 non-null float64
dtypes: float64(3)
memory usage: 6.2+ KB
Share:
10,556
Krzysztof Słowiński
Author by

Krzysztof Słowiński

Dedicated software problem solver inclined towards mathematics.

Updated on June 11, 2022

Comments

  • Krzysztof Słowiński
    Krzysztof Słowiński almost 2 years

    I am reading this Excel file United Nations Energy Indicators using the code snippet here:

    def convert_energy(energy):
        if isinstance(energy, float):
            return energy*1000000
        else:
            return energy
    
    def energy_df():
        return pd.read_excel("Energy Indicators.xls", skiprows=17, skip_footer=38, usecols=[2,3,4,5], na_values=['...'], names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'], converters={1: convert_energy}).set_index('Country')
    

    This results in Energy Supply column having the object type instead of float. Why is it the case?

    energy = energy_df()
    print(energy.dtypes)
    
    Energy Supply                object
    Energy Supply per Capita    float64
    % Renewable                 float64
    
    • cs95
      cs95 over 6 years
      A guess is that there are non-numeric entries, triggering the else condition to run. Why not return np.nan in the else bit?
    • Jon Clements
      Jon Clements over 6 years
      Why use a converter at all? Just load it, see if pandas gets a numeric type and multiple it, if it doesn't, overwrite the column with a coerced pd.to_numeric * 1000000 ?
    • Krzysztof Słowiński
      Krzysztof Słowiński over 6 years
      It is true, if I do not use a converter, the type is float64. That is a conversion that has to be done on the column, and I thought that converter is a good way of doing it.
    • ayhan
      ayhan over 6 years
      isinstance(1, float) returns False by the way.
  • Krzysztof Słowiński
    Krzysztof Słowiński over 6 years
    And using na_values=['...'] seems to not take effect?
  • cs95
    cs95 over 6 years
    Needed to edit c too, but index_col=[0] worked out eventually. Just needed to manually set the name later, but now it should be tonnes more efficient than before.
  • Jon Clements
    Jon Clements over 6 years
    Ahhh... right... you can use post naming as long as it's listed... so put Country back in the names list and then put index_col='Country' seems to work for me...
  • Scott Boston
    Scott Boston over 6 years
    Since you have only one na_value, let's try without the square brackets making it a list.
  • cs95
    cs95 over 6 years
    @JonClements Really? That was another option I tried, and it ended up giving me a ValueError: Index Country invalid, even after putting "Country" back into c, at the start. I'm running on 0.21. What about you?
  • Jon Clements
    Jon Clements over 6 years
    Yeah... 0.21.1 - mind you - I'm working on a mocked up DF - lemme try the actual thing.
  • cs95
    cs95 over 6 years
  • Jon Clements
    Jon Clements over 6 years
    shrugs - I'll think about it later - don't have time now - thanks though :)
  • cs95
    cs95 over 6 years
    It's a minor detail. Don't worry about it!
  • Timus
    Timus over 3 years
    Welcome to SO! When you are about to answer an old question (this one is almost 3 years old) that already has an accepted answer (that is the case here) please ask yourself: Do I really have a substantial improvement to offer? If not, consider refraining from answering.