Wednesday, 20 March 2019

What is Multicollinearity? What is Structural and Data Multicollinearity?

Multicollinearity is a situation in which two or more predictor (independent variables) in a model are highly correlated. 

For example, you have two explanatory variables – ‘time spent on treadmill in minutes’ and ‘calories burnt’. These variables are highly correlated as the more time you spend running on a treadmill, the more calories you will burn. Hence, there is no point in storing both as just one of them does what you require. Generally, if the correlation between the two independent variables is high (>= 0.8), then we drop one independent variable otherwise it may lead to multicollinearity  problem. 

If the degree of multicollinearity between the variables independent variables is high enough, it can cause problems when you fit the model and interpret the results.

Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them.

Types of Multicollinearity

1. Structural Multicollinearity: This type of multicollinearity occurs when we create a variable based on the another variable while creating the model. For example, there is a variable say "x", and you create another variable based on the "x" say "y" where y=cx (c is any constant). In this case, both "x" and "y" are correlated variables.

2. Data Multicollinearity: This type of multicollinearity is present in the data itself. So, we need to identify it during data wrangling process.

How to remove correlated variables?

Following techniques are used to handle multicollinearity problem in a dataset:

1. PCA (Principal Component Analysis)
2. SVD (Singular value Decomposition)

Related: Covariance vs Correlation