There is one Employee dataset. It contains features like Employee Age and Employee Salary. Now AGE feature contains values on the scale 25-60 and SALARY contains values on the scale 10000-100000. As these two features are different in scale, these need to be standardized to have common scale while building Machine Learning models. Difference of scale is very large which can adversely impact the algorithm performance. So, we need to standardize these features.
The idea behind Standardization is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
MEAN = 0
STANDARD DEVIATION = 1
Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.
How to implement Standardization using Scikit Learn Library in Python?
StandardScaler performs the task of Standardization.
First of all, you need to import StandardScaler:
from sklearn.preprocessing import StandardScaler
Assuming you have split your dataset like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Transform your X_train and X_test features like this
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)