More efficient way to mean center a sub-set of columns in a pandas dataframe and retain column names

python pandas machine-learning scikit-learn statsmodels

13,402

Solution 1

df.apply(lambda x: x-x.mean())

%timeit df.apply(lambda x: x-x.mean())
1000 loops, best of 3: 2.09 ms per loop

df.subtract(df.mean())

%timeit df.subtract(df.mean())
1000 loops, best of 3: 902 µs per loop

both yielding:

        i  we  you  shehe  they  ipron
0  0.0725   0    0  0.195 -0.18 -0.145
1  0.8025   0    0 -0.065 -0.18  0.495
2 -0.4375   0    0 -0.065  0.54  0.285
3 -0.4375   0    0 -0.065 -0.18 -0.635

Solution 2

I know this question is a little old, but by now Scikit is the fastest solution. Plus, you can condense the code in one line:

pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)

%timeit pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
684 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


test.subtract(df.mean())

%timeit df.subtract(df.mean())
1.63 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df I used for testing:

df = pd.DataFrame(np.random.randint(low=1, high=10, size=(20,5)),columns = list('abcde'))

13,402

Author by

R_Queery

Behavioral statistician taking a foray into text mining, topic modeling, and machine learning.

Updated on June 11, 2022

Comments

R_Queery almost 2 years
I have a dataframe that has about 370 columns. I'm testing a series of hypothesis that require me to use subsets of the model to fit a cubic regression model. I'm planning on using statsmodels to model this data.

Part of the process for polynomial regression involves mean centering variables (subtracting the mean from every case for a particular feature).

I can do this with 3 lines of code but it seems inefficient, given that I need to replicate this process for half a dozen hypothesis. Keep in mind that I need to data at the coefficient level from the statsmodel output so I need to retain the column names.

Here's a peek at the data. It's the sub-set of columns I need for one of my hypothesis tests.
```
      i  we  you  shehe  they  ipron
0  0.51   0    0   0.26  0.00   1.02
1  1.24   0    0   0.00  0.00   1.66
2  0.00   0    0   0.00  0.72   1.45
3  0.00   0    0   0.00  0.00   0.53
```
Here is the code that mean centers and keeps the column names.
```
from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]

#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')

#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)
```
Any recommendations on how to make this more efficient / faster would be awesome!