Building multi-regression model throws error: `Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).`

94,188

Solution 1

If X is your dataframe, try using the .astype method to convert to float when running the model:

est = sm.OLS(y, X.astype(float)).fit()

Solution 2

if both y(dependent) and X are taken from a data frame then type cast both:-

est = sm.OLS(y.astype(float), X.astype(float)).fit()

Solution 3

As Mário and Daniel suggested, yes, the issue is due to categorical values not previously converted into dummy variables.

I faced this issue reviewing StatLearning book lab on linear regression for the "Carseats" dataset from statsmodels, where the columns 'ShelveLoc', 'US' and 'Urban' are categorical values, I assume the categorical values causing issues in your dataset are also strings like in this one. Considering the previous, I will use this as an example since you didn't provide dataframes for the question.

The columns we have at the beginning are the following, as stated before 'ShelveLoc', 'US' and 'Urban'are categorical:

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')

In a simple line for Python, I converted them to categorical values and dropped the ones that had "No" and "Bad" labels (as this is what was being requested from the lab in the book).

carseats = pd.get_dummies(carseats, columns=['ShelveLoc', 'US', 'Urban'], drop_first = True)

This will return a dataframe with the following columns:

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'Age', 'Education', 'ShelveLoc_Good', 'ShelveLoc_Medium', 'US_Yes',
       'Urban_Yes'],
      dtype='object')

And that's it, you have dummy variables ready for OLS. Hope this is useful.

Solution 4

This is because you have NOT generated the dummy values step to all predictors so how can the regression take place over literals ? that is what the error message is saying it is trying to convert to numpy valid entries.

Just go back to your pipeline and include the dummies properly.

Share:
94,188
Sanoj
Author by

Sanoj

Updated on January 31, 2021

Comments

  • Sanoj
    Sanoj over 3 years

    I have pandas dataframe with some categorical predictors (i.e. variables) as 0 & 1, and some numeric variables. When I fit that to a stasmodel like:

    est = sm.OLS(y, X).fit()
    

    It throws:

    Pandas data cast to numpy dtype of object. Check input data with np.asarray(data). 
    

    I converted all the dtypes of the DataFrame using df.convert_objects(convert_numeric=True)

    After this all dtypes of dataframe variables appear as int32 or int64. But at the end it still shows dtype: object, like this:

    4516        int32
    4523        int32
    4525        int32
    4531        int32
    4533        int32
    4542        int32
    4562        int32
    sex         int64
    race        int64
    dispstd     int64
    age_days    int64
    dtype: object
    

    Here 4516, 4523 are variable labels.

    Any idea? I need to build a multi-regression model on more than hundreds of variables. For that I have concatenated 3 pandas DataFrames to come up with final DataFrame to be used in model building.

  • kiradotee
    kiradotee over 6 years
    so .. converting categorical variables to floats?
  • kiradotee
    kiradotee over 6 years
    so .. converting categorical variables to floats?
  • Daniel Gibson
    Daniel Gibson over 6 years
    all categorical variables should be converted into dummy variables before sticking them in the model, so yes
  • PatrickT
    PatrickT over 2 years
    And integers are not good enough, they must be floats! Int64 produces the same error as object or category ... sigh.