Monday, 11 February 2019

Dimensionality Reduction: Feature Selection and Feature Extraction Techniques in Machine Learning

Whenever you get any dataset, you don't directly jump to implement a model from it. Instead, you first and most important task is the analyze the data and clean it. This task consumes most of the time in Machine Learning. Dimensionality Reduction is one of the most important task at this phase. 

We will discuss various Dimensionality Reduction techniques in this article. I will not go in detail of each technique because it will drastically increase the length of this blog post. So, I will keep it short and simple.

Dimensionality Reduction is used to reduce the number of features or variables in the dataset without losing much information and improve the performance of the model.

Dimensionality Reduction can be done in two ways:

1. Feature Selection: Remove unwanted variables

2. Feature Extraction: Extract important variables. Find a smaller set of new variables, each being a combination of the original variables, containing basically the same information as the original variables.

Feature Selection Techniques:

1. Handle variables with missing values
2. Check for variance in a variable 
3. Check for correlation between two variables
4. Random Forest
5. Backward Feature Elimination
6. Forward Feature Selection

Feature Extraction Techniques:

1. Factor Analysis
2. PCA (Principal Component Analysis
3. SVD (Singular Value Decomposition)
4. LDA (Linear Discriminant Analysis)
5. MDS (Multi-Dimension Scaling)
6. t-SNE (t- Distributed Stochastic Neighbor Embedding)
7. ICA (Independent Component Analysis)

Lets elaborate above Dimensionality Reduction techniques:

Feature Selection Techniques:

1. Handle variables with missing values

1. If the count of missing values in a variable or a feature is greater than the threshold value, then remove the variable.

2. If there are not so many missing values in a variable or a feature, then you can do following:

  • If it is a numerical variable, then you can replace the missing value by finding the mean, median or standard deviation of the variable.
  • If it is a categorical variable, then you can replace the missing value by introducing a new category or class.

2. Check for variance in a variable 

You can drop a variable with zero or low variance because the variables with low variance will not affect the target variable. If all the values in a variable are approximately same, then you can easily drop this variable. 

For example, if almost all the values in a numerical variable contain 1, then you can drop this variable.

3. Check for correlation between two variables

High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance). 

For example, we can calculate the correlation between independent numerical variables. If the correlation coefficient crosses a certain threshold value, we can drop one of the variables (dropping a variable is highly subjective and should always be done keeping the domain in mind).

As a general guideline, we should keep those variables which show a decent or high correlation with the target variable.

4. Random Forest

Random Forest is one of the most widely used algorithms for feature selection. This helps us select a smaller subset of features.

This topic requires broad discussion. So, will make a separate post for this.

5. Backward Feature Elimination

1. Create a model with all variables (say n variables) and test its performance.

2. Remove one variable at a time, prepare the model with n-1 variables and test its performance. If there is no impact or least impact on the performance of the model, you can consider removing this variable.

3. Keep repeating this process for all the variables and check if you want to retain or drop that variable.

6. Forward Feature Selection (opposite of Backward Feature Elimination)

1. Prepare a model with one variable and test its performance.

2. Add another variable and again test its performance. If there is significant gain in performance in the model, then you can consider retaining this variable, otherwise you can drop the variable.

3. Keep repeating this process for all the variables and check if you want to retain or drop that variable.

Feature Extraction Techniques:

1. Factor Analysis

2. PCA (Principal Component Analysis)

3. SVD (Singular Value Decomposition)

4. LDA (Linear Discriminant Analysis)

5. MDS (Multi-Dimension Scaling)

6. t-SNE (t- Distributed Stochastic Neighbor Embedding)

7. ICA (Independent Component Analysis)

The above list requires detailed elaboration. So, I will discuss all of them in my future posts.

No comments:

Post a Comment