How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?
Solution 1
You could convert the DataFrame as a numpy array using as_matrix()
. Example on a random dataset:
Edit:
Changing as_matrix()
to values
, (it doesn't change the result) per the last sentence of the as_matrix()
docs above:
Generally, it is recommended to use ‘.values’.
import pandas as pd
import numpy as np #for the random integer example
df = pd.DataFrame(np.random.randint(0.0,100.0,size=(10,4)),
index=range(10,20),
columns=['col1','col2','col3','col4'],
dtype='float64')
Note, indices are 10-19:
In [14]: df.head(3)
Out[14]:
col1 col2 col3 col4
10 3 38 86 65
11 98 3 66 68
12 88 46 35 68
Now fit_transform
the DataFrame to get the scaled_features
array
:
from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(df.values)
In [15]: scaled_features[:3,:] #lost the indices
Out[15]:
array([[-1.89007341, 0.05636005, 1.74514417, 0.46669562],
[ 1.26558518, -1.35264122, 0.82178747, 0.59282958],
[ 0.93341059, 0.37841748, -0.60941542, 0.59282958]])
Assign the scaled data to a DataFrame (Note: use the index
and columns
keyword arguments to keep your original indices and column names:
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
In [17]: scaled_features_df.head(3)
Out[17]:
col1 col2 col3 col4
10 -1.890073 0.056360 1.745144 0.466696
11 1.265585 -1.352641 0.821787 0.592830
12 0.933411 0.378417 -0.609415 0.592830
Edit 2:
Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas
is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame
, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed.
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([(df.columns, StandardScaler())])
scaled_features = mapper.fit_transform(df.copy(), 4)
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
Solution 2
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('your file here')
ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(df),columns = df.columns)
The df_scaled will be the 'same' dataframe, only now with the scaled values
Solution 3
Reassigning back to df.values preserves both index and columns.
df.values[:] = StandardScaler().fit_transform(df)
Solution 4
features = ["col1", "col2", "col3", "col4"]
autoscaler = StandardScaler()
df[features] = autoscaler.fit_transform(df[features])
Solution 5
This worked with MinMaxScaler in getting back the array values to original dataframe. It should work on StandardScaler as well.
data_scaled = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
where, data_scaled is the new data frame, scaled_features = the array post normalization, df = original dataframe for which we need the index and columns back.
Related videos on Youtube
Louic
Data Science I am an experienced data scientist, currently working for a pension fund where our innovation team supports both the pension administration (automatic data anonymisation, mail classification, process automation, fraud detection) and the investment branch (portfolio optimisation, prediction, text mining, document search). We bring a product from intake to production, involving: defining a project with the business, choosing the technology and designing the software architecture, processing and loading the data, applying the appropriate statistics or machine learning, and finally making results available with an API and frontend deployed in the cloud. This is great fun! Most used tools: python, elasticsearch, javascript, html, css, fastAPI, git, azure, vim, vscode, bash, powershell Biophysics I have also spent 13 years in academia doing biophysics research, teaching, and supervising MSc and PhD students. I have taught statistics, computer programming, thermodynamics and nuclear magnetic resonance. In my research I used spectroscopy (lasers, magnetic fields and radio waves) and computer simulations to study the structure and interactions of biological molecules. Our aim was to better understand how new types of antibiotics kill bacteria, to support rational design of improved new compounds. This was pretty cool stuff but I enjoyed the data-analysis most of all, so I became a data scientist.
Updated on March 21, 2022Comments
-
Louic about 2 years
I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want.
features = df[["col1", "col2", "col3", "col4"]] autoscaler = StandardScaler() features = autoscaler.fit_transform(features)
A "solution" I found online is:
features = features.apply(lambda x: autoscaler.fit_transform(x))
It appears to work, but leads to a deprecationwarning:
/usr/lib/python3.5/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
I therefore tried:
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1)))
But this gives:
Traceback (most recent call last): File "./analyse.py", line 91, in features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1))) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 3972, in apply return self._apply_standard(f, axis, reduce=reduce) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 4081, in _apply_standard result = self._constructor(data=results, index=index) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 226, in init mgr = self._init_dict(data, index, columns, dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 363, in _init_dict dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5163, in _arrays_to_mgr arrays = _homogenize(arrays, index, dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5477, in _homogenize raise_cast_failure=False) File "/usr/lib/python3.5/site-packages/pandas/core/series.py", line 2885, in _sanitize_array raise Exception('Data must be 1-dimensional') Exception: Data must be 1-dimensional
How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.
-
Louic about 8 yearsThank you for the answer, but the problem still is that the rows are renumbered when the new dataframe is created from the array. The original dataframe does not contain consecutively numbered rows because some of them have been removed. I suppose I could also add an index=[...] keyword with the old index values. If you update your answer accordingly I can accept it.
-
Kevin about 8 yearsI hope the edit helps, I think your intuition about setting the index values from the first df was correct. The numbers I used are consecutive...(just wanted to show you can reset them to anything and range(10,20) was best I could think of. But it will work with any random index on the original df. HTH!
-
WestCoastProjects over 6 yearsI see that you have the last step as converting the output of the
DataFrameMapper
to aDataFrame
.. so the output is not already aDataFrame
? -
leokury about 5 yearsBut this does not maintain data types
-
Louic over 4 yearsThanks for your answer, but the solutions given as accepted answer are much better. Also, it can be done with dask-ml:
from dask_ml.preprocessing import StandardScaler; StandardScaler().fit_transform(df)
-
Piotr Labunski about 4 yearsWhile this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value.
-
gosuto almost 4 yearsWon't all data types become floats anyway since that is the only output of the scaler? What other outputs do you expect from it? @leokury
-
Nerxis over 3 years@StephenBoesch: Yes, the output is not
DataFrame
. If you want to get it directly from mapper, you have to usedf_out=True
option forDataFrameMapper
. -
Krishna Chaurasia over 3 yearsPlease consider adding explanation to the code for easier understanding.
-
Vega about 3 yearsThis now throws a: "SettingWithCopyError: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"
-
jajamaharaja over 2 years@Vega how do you deal with this?
-
Patricia over 2 yearsIn current versions, you must add the parameter
index=df.index
in order to keep the index from the original data frame. -
user4933 over 2 yearsUnderrated answer :D
-
Hindol over 2 yearsDid not work for me in the latest version of pandas.
-
Admin about 2 yearsYour answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
-
Quinten Cabo about 2 yearsThis is the reason I came here but I have not found an awnser yet. I asked this new question about it stackoverflow.com/questions/72232036/…
-
Jim almost 2 yearsI just tried it with pandas 1.4.2, (released 2 April 2022) and it works there.