Sunday, 17 February 2019

Why to standardize and transform the features in the dataset before applying Machine Learning algorithms?

Before applying Machine Learning algorithms to the dataset, we need to standardize and transform the features in the dataset. Why is it required? Lets try to understand by an example:

There is one Employee dataset. It contains features like Employee Age and Employee Salary. Now AGE feature contains values on the scale 25-60 and SALARY contains values on the scale 10000-100000. As these two features are different in scale, these need to be standardized to have common scale while building Machine Learning models. Difference of scale is very large which can adversely impact the algorithm performance. So, we need to standardize these features.

The idea behind Standardization is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

MEAN = 0
STANDARD DEVIATION = 1

Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.

How to implement Standardization using Scikit Learn Library in Python?

StandardScaler performs the task of Standardization.

First of all, you need to import StandardScaler:

from sklearn.preprocessing import StandardScaler 

Assuming you have split your dataset like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) 

Transform your X_train and X_test features like this

scaler = StandardScaler()  
scaler.fit(X_train)
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)