Pages

Tuesday, 19 March 2019

Data Wrangling Techniques: Steps involved in Data Wrangling Process

Data Wrangling is the first step we take while creating a Machine Learning model. This is the main step in which we prepare data for a Machine Learning algorithm. This step is very crucial and takes up to 60 to 80 percent of time. 

In Data Wrangling, we convert the raw data into a suitable format which we can input to any Machine Learning algorithm. Data Wrangling and Data Preprocessing terms are used interchangeably. Data Wrangling is an art. You should have a lot of patience while making your data fit for Machine Learning algorithm. 

Lets see what are the various steps one should take while Data Wrangling?

1. Drop unnecessary columns

1A. Drop the columns which contain IDs, Names etc. 

For example, in Titanic dataset, we can easily drop Passenger Id, Passenger Name and Ticket Number columns which are not required for any kind of prediction. Read more...

1B. Drop the columns which contain a lot of null or missing values

The columns which contain around 75% of missing values should be dropped from the dataset. For example, in Titanic dataset, cabin column contains 687 null values out of 891 observations (77% missing values). So, it makes sense to drop this column from the dataset. Read more... 

Visualize missing values using Bar Plot

1C. Drop the columns which have low variance 

You can drop a variable with zero or low variance because the variables with low variance will not affect the target variable. If all the values in a variable are approximately same, then you can easily drop this variable. 

For example, if almost all the values in a numerical variable contain 1, then you can drop this variable.

2. Remove rows containing null values

If there are around 10-15% observations which contain null values, we can consider removing those observations. Read more...

3. Remove noise

Noise a is data that is meaningless, distorted and corrupted. Noise includes invalid values, outliers and skewed values in the dataset. We need to remove this noise before supplying this dataset to an algorithm. Domain knowledge plays an important role in identifying and removing the noisy data.

3A. Replace invalid values

Many times there are invalid values present in the dataset. For example, in Pima Indian Diabetes dataset, there are zero values for Blood Pressure, Glucose, Insulin etc. which is invalid. So, we need to replace these values with some meaningful values. Domain knowledge plays a crucial role in identifying the invalid values. Read more...

3B. Remove outliers

It is very important to remove outliers from the dataset as these outliers adversely affect the accuracy of the algorithms.

What are outliers? How to remove them?

3C. Log Transform Skewed Variables

We should check distribution for all the variables in the dataset and if it is skewed, we should use log transformation to make it normal distributed.

What is Skewnesss? How to visualize it with Histogram and how to remove it?
How to visualize skewness of numeric variables by plotting histograms?
Log Transforming the Skewed Data to get Normal Distribution

4. Impute missing values

Step 2 is usually not recommended as you may lose significant data. So, better try imputing the missing values with some meaningful values.

For numeric columns, you can impute the missing values with mean, median or mode. Read more...

Implementation of Imputer in Python

For categorical columns, you can impute the missing values by introducing a new category or with the category which is most frequently used. Read more...

5. Transform non-numeric variables to numeric variables

There are numeric and non-numeric variables in the dataset. We need to handle these differently. 

How to separate numeric and categorical variables?

5A. Transform categorical variables to dummy variables

To transform categorical variables to dummy variables, we can use LabelEncoder, OneHotEncoder and get_dummies methods present in Scikit Learn and Pandas library in Python.

5B. Transform date variables to numeric variables

By default, dates are treated as string values. We should convert it to numeric one.

How to convert dates into numbers in the dataset?

6. Feature Engineering

Feature Engineering involves Binning, Scaling (Normalization and Standardization), Dimensionality Reduction etc. We need to standardize and normalize all the features in the dataset before running any algorithm on the dataset. Standardization and Normalization are the feature scaling techniques which bring down all the values on the same scale and range. Features should be numeric in nature.

Binning Technique
Importance of Feature Scaling
Standardization vs Normalization
Implement Normalization in Python
Which algorithms require scaling and which not?

7. Dimensionality Reduction

Dimensionality reduction is required to remove the correlated variables and maximize the performance of the model. Basic techniques used for dimensionality reduction are:
  • PCA (Principal Component Analysis)
  • SVD (Singular Vector Decomposition)
  • LDA (Linear Discriminant Analysis)
  • MDS (Mulit-dimension Scaling)
  • t-SNE (t-Distributed Stochastic Neighbor Embedding)
  • ICA (Independent Component Analysis)
Please go through my previous posts on dimensionality reduction to understand the need of this step.

Multicollinearity
Covariance vs Correlation
Visualize correlation score using Heatmap
Feature Selection and Feature Extraction
Need of Dimensionlity Reduction
Factor Analysis
PCA, t-SNE, PCA vs t-SNE 
Implement PCA in Python

8. Splitting the dataset into training and testing data

We should not use the entire dataset to train a model. We should keep aside around 20% of data to test the accuracy of the model. So, usually we maintain a ratio of 80:20 between training and testing datasets.

1 comment:

  1. This article is a golden nugget. All prime Data Wrangling strategies on one page.

    Thank you Naresh!

    ReplyDelete