Adding scikit-learn (sklearn) prediction to pandas data frame
You're correct with your second line, df_total["pred_lin_regr"] = clf.predict(Xtest)
and it's more efficient.
In that one you're taking the output of clf.predict()
, which happens to be an array, and adding it to a dataframe. The output you're receiving from the array itself is in order to match Xtest
, since that's the case, adding it to a numpy array will not change or alter that order.
Here's a little proof from this example:
Taking the following protion:
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
print(regr.predict(diabetes_X_test))
df = pd.DataFrame(regr.predict(diabetes_X_test))
print(df)
The first print()
function will give us a numpy array as expected:
[ 225.9732401 115.74763374 163.27610621 114.73638965 120.80385422
158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]
That order is identical to the second print()
function in which we add the results to a dataframe:
0
0 225.973240
1 115.747634
2 163.276106
3 114.736390
4 120.803854
5 158.219886
6 236.085681
7 121.815098
8 99.567728
9 123.837587
10 204.737114
11 96.533996
12 154.174909
13 130.916295
14 83.387823
15 171.366059
16 137.995004
17 137.995004
18 189.568453
19 84.399067
Rerunning the code for a portion of the test, will give us the same ordered results as such:
print(regr.predict(diabetes_X_test[0:5]))
df = pd.DataFrame(regr.predict(diabetes_X_test[0:5]))
print(df)
[ 225.9732401 115.74763374 163.27610621 114.73638965 120.80385422]
0
0 225.973240
1 115.747634
2 163.276106
3 114.736390
4 120.803854
bolla
Updated on November 16, 2020Comments
-
bolla over 3 years
I am trying to add a sklearn prediction to a pandas dataframe, so that I can make a thorough evaluation of the prediction. The relavant piece of code is the following:
clf = linear_model.LinearRegression() clf.fit(Xtrain,ytrain) ypred = pd.DataFrame({'pred_lin_regr': pd.Series(clf.predict(Xtest))})
The dataframes look like this:
Xtest
axial_MET cos_theta_r1 deltaE_abs lep1_eta lep1_pT lep2_eta 8000 1.383026 0.332365 1.061852 0.184027 0.621598 -0.316297 8001 -1.054412 0.046317 1.461788 -1.141486 0.488133 1.011445 8002 0.259077 0.429920 0.769219 0.631206 0.353469 1.027781 8003 -0.096647 0.066200 0.411222 -0.867441 0.856115 -1.357888 8004 0.145412 0.371409 1.111035 1.374081 0.485231 0.900024
ytest
8000 1 8001 0 8002 0 8003 0 8004 0
ypred
pred_lin_regr 0 0.461636 1 0.314448 2 0.363751 3 0.291858 4 0.416056
Concatenating Xtest and ytest works fine:
df_total = pd.concat([Xtest, ytest], axis=1)
but the event information is lost on ypred.
What would be the must python/pandas/numpy-like way to do this?
I am using the following versions:
argparse==1.2.1 cycler==0.9.0 decorator==4.0.4 ipython==4.0.0 ipython-genutils==0.1.0 matplotlib==1.5.0 nose==1.3.7 numpy==1.10.1 pandas==0.17.0 path.py==8.1.2 pexpect==4.0.1 pickleshare==0.5 ptyprocess==0.5 py==1.4.30 pyparsing==2.0.5 pytest==2.8.2 python-dateutil==2.4.2 pytz==2015.7 scikit-learn==0.16.1 scipy==0.16.1 simplegeneric==0.8.1 six==1.10.0 sklearn==0.0 traitlets==4.0.0 wsgiref==0.1.2
I tried the following:
df_total["pred_lin_regr"] = clf.predict(Xtest)
seems to do the job, but I think I can't be sure that the events are matched correctly
-
asmgx almost 5 yearsHow to fix the same problem if Xtest was a random selection and indexes were randomized, in that case we cannot match each record in the both dataframes