Data Wrangling is the first step we take while creating a Machine Learning model. This is the main step in which we prepare data for a Machine Learning algorithm. This step is very crucial and takes up to 60 to 80 percent of time.

In Data Wrangling, we convert the raw data into a suitable format which we can input to any Machine Learning algorithm. Data Wrangling and Data Preprocessing terms are used interchangeably. Data Wrangling is an art. You should have a lot of patience while making your data fit for Machine Learning algorithm.

Lets see what are the various steps one should take while Data Wrangling?

For example, in Titanic dataset, we can easily drop

The columns which contain around 75% of missing values should be dropped from the dataset. For example, in Titanic dataset, cabin column contains 687 null values out of 891 observations (77% missing values). So, it makes sense to drop this column from the dataset. Read more...

Visualize missing values using Bar Plot

You can drop a variable with zero or low variance because the variables with low variance will not affect the target variable. If all the values in a variable are approximately same, then you can easily drop this variable.

For example, if almost all the values in a numerical variable contain 1, then you can drop this variable.

If there are around 10-15% observations which contain null values, we can consider removing those observations. Read more...

Noise a is data that is meaningless, distorted and corrupted. Noise includes invalid values, outliers and skewed values in the dataset. We need to remove this noise before supplying this dataset to an algorithm. Domain knowledge plays an important role in identifying and removing the noisy data.

Many times there are invalid values present in the dataset. For example, in Pima Indian Diabetes dataset, there are zero values for Blood Pressure, Glucose, Insulin etc. which is invalid. So, we need to replace these values with some meaningful values. Domain knowledge plays a crucial role in identifying the invalid values. Read more...

It is very important to remove outliers from the dataset as these outliers adversely affect the accuracy of the algorithms.

What are outliers? How to remove them?

We should check distribution for all the variables in the dataset and if it is skewed, we should use log transformation to make it normal distributed.

What is Skewnesss? How to visualize it with Histogram and how to remove it?

How to visualize skewness of numeric variables by plotting histograms?

Log Transforming the Skewed Data to get Normal Distribution

Step 2 is usually not recommended as you may lose significant data. So, better try imputing the missing values with some meaningful values.

For

Implementation of Imputer in Python

For

There are numeric and non-numeric variables in the dataset. We need to handle these differently.

How to separate numeric and categorical variables?

To transform categorical variables to dummy variables, we can use LabelEncoder, OneHotEncoder and get_dummies methods present in Scikit Learn and Pandas library in Python.

By default, dates are treated as string values. We should convert it to numeric one.

How to convert dates into numbers in the dataset?

Feature Engineering involves

Binning Technique

Importance of Feature Scaling

Standardization vs Normalization

Implement Normalization in Python

Which algorithms require scaling and which not?

Dimensionality reduction is required to remove the correlated variables and maximize the performance of the model. Basic techniques used for dimensionality reduction are:

Multicollinearity

Covariance vs Correlation

Visualize correlation score using Heatmap

Feature Selection and Feature Extraction

Need of Dimensionlity Reduction

Factor Analysis

PCA, t-SNE, PCA vs t-SNE

Implement PCA in Python

We should not use the entire dataset to train a model. We should keep aside around 20% of data to test the accuracy of the model. So, usually we maintain a ratio of 80:20 between training and testing datasets.

In Data Wrangling, we convert the raw data into a suitable format which we can input to any Machine Learning algorithm. Data Wrangling and Data Preprocessing terms are used interchangeably. Data Wrangling is an art. You should have a lot of patience while making your data fit for Machine Learning algorithm.

Lets see what are the various steps one should take while Data Wrangling?

**1. Drop unnecessary columns**

**1A.****Drop the columns****which contain IDs, Names etc.**For example, in Titanic dataset, we can easily drop

**Passenger Id**,**Passenger Name**and**Ticket Number**columns which are not required for any kind of prediction. Read more...**1B.****Drop the columns which contain a lot of null or missing values**The columns which contain around 75% of missing values should be dropped from the dataset. For example, in Titanic dataset, cabin column contains 687 null values out of 891 observations (77% missing values). So, it makes sense to drop this column from the dataset. Read more...

Visualize missing values using Bar Plot

**1C. Drop the columns which have low variance**You can drop a variable with zero or low variance because the variables with low variance will not affect the target variable. If all the values in a variable are approximately same, then you can easily drop this variable.

For example, if almost all the values in a numerical variable contain 1, then you can drop this variable.

**2.****Remove rows containing null values**If there are around 10-15% observations which contain null values, we can consider removing those observations. Read more...

**3. Remove noise**Noise a is data that is meaningless, distorted and corrupted. Noise includes invalid values, outliers and skewed values in the dataset. We need to remove this noise before supplying this dataset to an algorithm. Domain knowledge plays an important role in identifying and removing the noisy data.

**3A. Replace invalid values**Many times there are invalid values present in the dataset. For example, in Pima Indian Diabetes dataset, there are zero values for Blood Pressure, Glucose, Insulin etc. which is invalid. So, we need to replace these values with some meaningful values. Domain knowledge plays a crucial role in identifying the invalid values. Read more...

**3B. Remove outliers**It is very important to remove outliers from the dataset as these outliers adversely affect the accuracy of the algorithms.

What are outliers? How to remove them?

**3C. Log Transform Skewed Variables**We should check distribution for all the variables in the dataset and if it is skewed, we should use log transformation to make it normal distributed.

What is Skewnesss? How to visualize it with Histogram and how to remove it?

How to visualize skewness of numeric variables by plotting histograms?

Log Transforming the Skewed Data to get Normal Distribution

**4.****Impute missing values**Step 2 is usually not recommended as you may lose significant data. So, better try imputing the missing values with some meaningful values.

For

**numeric columns**, you can impute the missing values with mean, median or mode. Read more...Implementation of Imputer in Python

For

**categorical columns**, you can impute the missing values by introducing a new category or with the category which is most frequently used. Read more...**5. Transform non-numeric variables to numeric variables**There are numeric and non-numeric variables in the dataset. We need to handle these differently.

How to separate numeric and categorical variables?

**5A. Transform categorical variables to dummy variables**To transform categorical variables to dummy variables, we can use LabelEncoder, OneHotEncoder and get_dummies methods present in Scikit Learn and Pandas library in Python.

**5B. Transform date variables to numeric variables**By default, dates are treated as string values. We should convert it to numeric one.

How to convert dates into numbers in the dataset?

**6. Feature Engineering**Feature Engineering involves

**Binning**,**Scaling**(**Normalization and****Standardization), Dimensionality Reduction**etc. We need to standardize and normalize all the features in the dataset before running any algorithm on the dataset. Standardization and Normalization are the feature scaling techniques which bring down all the values on the same scale and range. Features should be numeric in nature.Binning Technique

Importance of Feature Scaling

Standardization vs Normalization

Implement Normalization in Python

Which algorithms require scaling and which not?

**7. Dimensionality Reduction**Dimensionality reduction is required to remove the correlated variables and maximize the performance of the model. Basic techniques used for dimensionality reduction are:

**PCA**(Principal Component Analysis)**SVD**(Singular Vector Decomposition)**LDA**(Linear Discriminant Analysis)**MDS**(Mulit-dimension Scaling)**t-SNE**(t-Distributed Stochastic Neighbor Embedding)**ICA**(Independent Component Analysis)

Multicollinearity

Covariance vs Correlation

Visualize correlation score using Heatmap

Feature Selection and Feature Extraction

Need of Dimensionlity Reduction

Factor Analysis

PCA, t-SNE, PCA vs t-SNE

Implement PCA in Python

**8. Splitting the dataset into training and testing data**We should not use the entire dataset to train a model. We should keep aside around 20% of data to test the accuracy of the model. So, usually we maintain a ratio of 80:20 between training and testing datasets.

This article is a golden nugget. All prime Data Wrangling strategies on one page.

ReplyDeleteThank you Naresh!