Pages

Thursday, 21 March 2019

Feature Scaling Techniques: Difference between Normalization and Standardization

Both Normalization and Standardization are the feature scaling techniques which help in dealing with variables of different units and scales. This is a very important step in the data preprocessing and data wrangling.

For example, consider an Employee dataset. It contains features like Employee Age and Employee Salary. Now Age feature contains values on the scale 22-60 and Salary contains values on the scale 10000-100000. As these two features are different in scale, these need to be normalized and standardized to have common scale while building any Machine Learning model. Some algorithms have this feature built-in, but for some algorithms you must do it.

Here, Salary feature is dominating the Age feature. So, if we don't want one variable to dominate other, then we use either Normalization or Standardization. 

Disadvantage of Feature Scaling: Both Age and Salary will be in same scale after using standardization or normalization, but we will lose original values as it will get transformed to some other values. So there is loss of interpretation of the values in the feature but in return our model becomes consistent. 

Normalization

Normalization scales the values of a feature into a range of [0,1].

Xnew = (X – Xmin) / (Xmax – Xmin)

Disadvantage of normalization over standardization is that it loses some information in the data, especially about outliers. 

It will be useful when we are sure enough that there are no anomalies (i.e. outliers) with extremely large or small values. For example, in a recommendation system, the ratings made by users are limited to a small finite set like {1, 2, 3, 4, 5}

Standardization 

Standardization refer to the subtraction of the mean (μ) and then dividing by its standard deviation (σ). Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1.

Xnew = (X - µ ) / σ

For most of the applications, standardization is recommended over normalization. For more details on Standardization, please go through my this post.

No comments:

Post a Comment