Data Wrangling is the first step we take while creating a Machine Learning model. This is the main step in which we prepare data for a Machine Learning algorithm. This step is very crucial and takes up to 60 to 80 percent of time.
In Data Wrangling, we convert the raw data into a suitable format which we can input to any Machine Learning algorithm. Data Wrangling and Data Preprocessing terms are used interchangeably. Data Wrangling is an art. You should have a lot of patience while making your data fit for Machine Learning algorithm.
Lets see what are the various steps one should take while Data Wrangling?
1. Drop unnecessary columns
1A. Drop the columns which contain IDs, Names etc.
For example, in Titanic dataset, we can easily drop Passenger Id, Passenger Name and Ticket Number columns which are not required for any kind of prediction. Read more...
1B. Drop the columns which contain a lot of null or missing values
The columns which contain around 75% of missing values should be dropped from the dataset. For example, in Titanic dataset, cabin column contains 687 null values out of 891 observations (77% missing values). So, it makes sense to drop this column from the dataset. Read more...
Visualize missing values using Bar Plot
1C. Drop the columns which have low variance
You can drop a variable with zero or low variance because the variables with low variance will not affect the target variable. If all the values in a variable are approximately same, then you can easily drop this variable.
For example, if almost all the values in a numerical variable contain 1, then you can drop this variable.
2. Remove rows containing null values
If there are around 10-15% observations which contain null values, we can consider removing those observations. Read more...
3. Remove noise
Noise a is data that is meaningless, distorted and corrupted. Noise includes invalid values, outliers and skewed values in the dataset. We need to remove this noise before supplying this dataset to an algorithm. Domain knowledge plays an important role in identifying and removing the noisy data.
3A. Replace invalid values
Many times there are invalid values present in the dataset. For example, in Pima Indian Diabetes dataset, there are zero values for Blood Pressure, Glucose, Insulin etc. which is invalid. So, we need to replace these values with some meaningful values. Domain knowledge plays a crucial role in identifying the invalid values. Read more...
3B. Remove outliers
It is very important to remove outliers from the dataset as these outliers adversely affect the accuracy of the algorithms.
What are outliers? How to remove them?
3C. Log Transform Skewed Variables
We should check distribution for all the variables in the dataset and if it is skewed, we should use log transformation to make it normal distributed.
What is Skewnesss? How to visualize it with Histogram and how to remove it?
How to visualize skewness of numeric variables by plotting histograms?
Log Transforming the Skewed Data to get Normal Distribution
4. Impute missing values
Step 2 is usually not recommended as you may lose significant data. So, better try imputing the missing values with some meaningful values.
For numeric columns, you can impute the missing values with mean, median or mode. Read more...
Implementation of Imputer in Python
For categorical columns, you can impute the missing values by introducing a new category or with the category which is most frequently used. Read more...
5. Transform non-numeric variables to numeric variables
There are numeric and non-numeric variables in the dataset. We need to handle these differently.
How to separate numeric and categorical variables?
5A. Transform categorical variables to dummy variables
To transform categorical variables to dummy variables, we can use LabelEncoder, OneHotEncoder and get_dummies methods present in Scikit Learn and Pandas library in Python.
5B. Transform date variables to numeric variables
By default, dates are treated as string values. We should convert it to numeric one.
How to convert dates into numbers in the dataset?
6. Feature Engineering
Feature Engineering involves Binning, Scaling (Normalization and Standardization), Dimensionality Reduction etc. We need to standardize and normalize all the features in the dataset before running any algorithm on the dataset. Standardization and Normalization are the feature scaling techniques which bring down all the values on the same scale and range. Features should be numeric in nature.
Binning Technique
Importance of Feature Scaling
Standardization vs Normalization
Implement Normalization in Python
Which algorithms require scaling and which not?
7. Dimensionality Reduction
Dimensionality reduction is required to remove the correlated variables and maximize the performance of the model. Basic techniques used for dimensionality reduction are:
Multicollinearity
Covariance vs Correlation
Visualize correlation score using Heatmap
Feature Selection and Feature Extraction
Need of Dimensionlity Reduction
Factor Analysis
PCA, t-SNE, PCA vs t-SNE
Implement PCA in Python
8. Splitting the dataset into training and testing data
We should not use the entire dataset to train a model. We should keep aside around 20% of data to test the accuracy of the model. So, usually we maintain a ratio of 80:20 between training and testing datasets.
In Data Wrangling, we convert the raw data into a suitable format which we can input to any Machine Learning algorithm. Data Wrangling and Data Preprocessing terms are used interchangeably. Data Wrangling is an art. You should have a lot of patience while making your data fit for Machine Learning algorithm.
Lets see what are the various steps one should take while Data Wrangling?
1. Drop unnecessary columns
1A. Drop the columns which contain IDs, Names etc.
For example, in Titanic dataset, we can easily drop Passenger Id, Passenger Name and Ticket Number columns which are not required for any kind of prediction. Read more...
1B. Drop the columns which contain a lot of null or missing values
The columns which contain around 75% of missing values should be dropped from the dataset. For example, in Titanic dataset, cabin column contains 687 null values out of 891 observations (77% missing values). So, it makes sense to drop this column from the dataset. Read more...
Visualize missing values using Bar Plot
1C. Drop the columns which have low variance
You can drop a variable with zero or low variance because the variables with low variance will not affect the target variable. If all the values in a variable are approximately same, then you can easily drop this variable.
For example, if almost all the values in a numerical variable contain 1, then you can drop this variable.
2. Remove rows containing null values
If there are around 10-15% observations which contain null values, we can consider removing those observations. Read more...
3. Remove noise
Noise a is data that is meaningless, distorted and corrupted. Noise includes invalid values, outliers and skewed values in the dataset. We need to remove this noise before supplying this dataset to an algorithm. Domain knowledge plays an important role in identifying and removing the noisy data.
3A. Replace invalid values
Many times there are invalid values present in the dataset. For example, in Pima Indian Diabetes dataset, there are zero values for Blood Pressure, Glucose, Insulin etc. which is invalid. So, we need to replace these values with some meaningful values. Domain knowledge plays a crucial role in identifying the invalid values. Read more...
3B. Remove outliers
It is very important to remove outliers from the dataset as these outliers adversely affect the accuracy of the algorithms.
What are outliers? How to remove them?
3C. Log Transform Skewed Variables
We should check distribution for all the variables in the dataset and if it is skewed, we should use log transformation to make it normal distributed.
What is Skewnesss? How to visualize it with Histogram and how to remove it?
How to visualize skewness of numeric variables by plotting histograms?
Log Transforming the Skewed Data to get Normal Distribution
4. Impute missing values
Step 2 is usually not recommended as you may lose significant data. So, better try imputing the missing values with some meaningful values.
For numeric columns, you can impute the missing values with mean, median or mode. Read more...
Implementation of Imputer in Python
For categorical columns, you can impute the missing values by introducing a new category or with the category which is most frequently used. Read more...
5. Transform non-numeric variables to numeric variables
There are numeric and non-numeric variables in the dataset. We need to handle these differently.
How to separate numeric and categorical variables?
5A. Transform categorical variables to dummy variables
To transform categorical variables to dummy variables, we can use LabelEncoder, OneHotEncoder and get_dummies methods present in Scikit Learn and Pandas library in Python.
5B. Transform date variables to numeric variables
By default, dates are treated as string values. We should convert it to numeric one.
How to convert dates into numbers in the dataset?
6. Feature Engineering
Feature Engineering involves Binning, Scaling (Normalization and Standardization), Dimensionality Reduction etc. We need to standardize and normalize all the features in the dataset before running any algorithm on the dataset. Standardization and Normalization are the feature scaling techniques which bring down all the values on the same scale and range. Features should be numeric in nature.
Binning Technique
Importance of Feature Scaling
Standardization vs Normalization
Implement Normalization in Python
Which algorithms require scaling and which not?
7. Dimensionality Reduction
Dimensionality reduction is required to remove the correlated variables and maximize the performance of the model. Basic techniques used for dimensionality reduction are:
- PCA (Principal Component Analysis)
- SVD (Singular Vector Decomposition)
- LDA (Linear Discriminant Analysis)
- MDS (Mulit-dimension Scaling)
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
- ICA (Independent Component Analysis)
Multicollinearity
Covariance vs Correlation
Visualize correlation score using Heatmap
Feature Selection and Feature Extraction
Need of Dimensionlity Reduction
Factor Analysis
PCA, t-SNE, PCA vs t-SNE
Implement PCA in Python
8. Splitting the dataset into training and testing data
We should not use the entire dataset to train a model. We should keep aside around 20% of data to test the accuracy of the model. So, usually we maintain a ratio of 80:20 between training and testing datasets.
This article is a golden nugget. All prime Data Wrangling strategies on one page.
ReplyDeleteThank you Naresh!