ValueError: X has 29 features per sample; expecting 84

python pandas scikit-learn valueerror one-hot-encoding

24,910

Solution 1

Your new data should have the same exact columns as the data that you used to train and persist your original model. And if the number of unique values of the categorical variables is lesser in the newer data, manually add columns for those variables after doing pd.get_dummies() and set them to zero for all the data points.

The model will work only when it gets the required number of columns. If pd.get_dummies fails to create all those columns on the newer data, you should do it manually.

Edit

If you want to automatically insert the missing columns after the pd.get_dummies() step, you can use the following approach. Assuming that df_newdata is the dataframe after applying pd.get_dummies() tot he new dataset and df_olddata is the df that you got when you applied pd.get_dummies() on the older dataset(which was used for training), you can simply do this:

df_newdata = df_newdata.reindex(labels=df_olddata.columns,axis=1)

This will automatically create the missing columns in df_newdata (in comparison to df_olddata) and set the values of these columns to NaN for all the rows. So now, your new dataframe has the same exct columns as the original dataframe had.

Hope this helps

Solution 2

Use just Transform instead of fit_transform. This should do the trick. Hope it helps.

    x_scbcd = scaler.transform(dfbcd)

Solution 3

Could you try using the transform method of x_scbcd [StandardScaler object] on your testing data object lrbcd_test? This will create a feature representation of your testing data.

ypredbcdfinal = lrbcd_test.predict(x_scbcd.transform(x_scbcd))

24,910

Author by

jz451

Updated on July 09, 2022

Comments

jz451 almost 2 years

I am working on a script using the Lending Club API to predict whether a loan will "pay in full" or "charge off". To do this I am using scikit-learn to build the model and persisted using joblib. I run into a ValueError due to a difference between the number of columns in the persisted model and the number of columns from new raw data. The ValueError is caused from creating dummy variables for categorical variables. The number of columns used in the model is 84 and in this example the number of columns using the new data is 29.

The number of columns needs to be 84 for the new data when making dummy variables but I am not sure how to proceed since only a subset of all possible values for the categorical variables 'homeOwnership','addrState', and 'purpose' are present when obtaining new data from the API.

Here's the code I am testing at the moment starting at the point where the categorical variables are transformed into dummy variables and stopping at model implementation.

#......continued

df['mthsSinceLastDelinq'].notnull().astype('int')
df['mthsSinceLastRecord'].notnull().astype('int')
df['grade_num'] = df['grade'].map({'A':0,'B':1,'C':2,'D':3})
df['emp_length_num'] = df['empLength']
df = pd.get_dummies(df,columns=['homeOwnership','addrState','purpose'])
# df = pd.get_dummies(df,columns=['home_ownership','addr_state','verification_status','purpose'])

# step 3.5 transform data before making predictions

df.drop(['id','grade','empLength','isIncV'],axis=1,inplace=True)
dfbcd = df[df['grade_num'] != 0]
scaler = StandardScaler()
x_scbcd = scaler.fit_transform(dfbcd)

# step 4 predicting

lrbcd_test = load('lrbcd_test.joblib')
ypredbcdfinal = lrbcd_test.predict(x_scbcd)

Here's the error message

ValueError                                Traceback (most recent call last)
<ipython-input-239-c99611b2e48a> in <module>
     11 # change name of model and file name
     12 lrbcd_test = load('lrbcd_test.joblib')
---> 13 ypredbcdfinal = lrbcd_test.predict(x_scbcd)
     14 
     15     #add model

~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    287             Predicted class label per sample.
    288         """
--> 289         scores = self.decision_function(X)
    290         if len(scores.shape) == 1:
    291             indices = (scores > 0).astype(np.int)

~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
    268         if X.shape[1] != n_features:
    269             raise ValueError("X has %d features per sample; expecting %d"
--> 270                              % (X.shape[1], n_features))
    271 
    272         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 29 features per sample; expecting 84

jz451 over 4 years

Tried your suggestion but got the following error: AttributeError: 'numpy.ndarray' object has no attribute 'transform'. Also I tried persisting and loading the scaler used during training/testing and got this error: ValueError: operands could not be broadcast together with shapes (3,26) (84,) (3,26).
jz451 over 4 years

The variables for the new data are the same when the API is called as the data used in training/testing, but pd.get_dummies/onehotencoder obviously don't insert columns for values not in the new data. Doing it manually won't work as I'm trying to automate the process. What I don't know is how to create the missing columns in a simple manner.
jz451 over 4 years

I have thought of something like that to create the necessary columns but was trying to get a response for a more efficient way that's not using brute force. Especially since I would need create create over 40 columns each time. There's got to be some way deal with this since most will models deal with dummy encoding, but haven't found a solutions thus far.
Admin about 2 years

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.