Sunday, 17 February 2019

Why to standardize and transform the features in the dataset before applying Machine Learning algorithms?

Before applying Machine Learning algorithms to the dataset, we need to standardize and transform the features in the dataset. Why is it required? Lets try to understand by an example:

There is one Employee dataset. It contains features like Employee Age and Employee Salary. Now AGE feature contains values on the scale 25-60 and SALARY contains values on the scale 10000-100000. As these two features are different in scale, these need to be standardized to have common scale while building Machine Learning models. Difference of scale is very large which can adversely impact the algorithm performance. So, we need to standardize these features.

The idea behind Standardization is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

MEAN = 0
STANDARD DEVIATION = 1

Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.

How to implement Standardization using Scikit Learn Library in Python?

StandardScaler performs the task of Standardization.

First of all, you need to import StandardScaler:

from sklearn.preprocessing import StandardScaler 

Assuming you have split your dataset like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) 

Transform your X_train and X_test features like this

scaler = StandardScaler()  
scaler.fit(X_train)
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)

Saturday, 16 February 2019

Basic steps to implement a Machine Learning Algorithm in Python

I will illustrate some basic and common steps which you have to take while implementing any Machine Learning algorithm in Python. In this post, I will take a simple example of XGBoost algorithm.

To implement XGBoost algorithm in Python, you need to first import the required libraries, load the dataset, mention X and Y coordinates, split your dataset into training and test, fit and predict the data from the algorithm and then finally check the accuracy.

Assumptions:

1. You have basic knowledge of Python, Jupyter Notebook and Machine Learning Libraries in Python.

2. Dataset is in proper format. So, I don't have to do any data wrangling and implement any dimensionality reduction technique.

3. You know basics of XGBoost algorithm.

Steps:

1. Import libraries like pandas, numpy, sklearn etc.

2. Load dataset

3. Mention X and Y dimensions

4. Split data into test and training dataset

5. Fit training dataset to the algorithm

6. Predict test dataset from the algorithm

7. Check accuracy

1. Import libraries

from numpy import loadtxt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

2. Load dataset (I will load PIMA Indians Diabetes dataset)

dataset = loadtxt('C:\\datasets\\pima-indians-diabetes.csv', delimiter=",")

3. Mention X and Y dimensions

X = dataset[:,0:8]
Y = dataset[:,8]

4. Split dataset into test and training dataset

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=0)

5. Fit training dataset to XGBoost Classification Algorithm

model = XGBClassifier()
model.fit(X_train, y_train)

6. Predict test dataset from XGBoost Classification Algorithm

y_pred = model.predict(X_test)

7. Check accuracy

accuracy = accuracy_score(y_test, y_pred) * 100
print(accuracy)

Friday, 15 February 2019

How to find out versions of Machine Learning Libraries in Python?

Lets try to find out versions of some commonly used libraries for Machine Learning in Python. Following Python code written in Jupyter notebook finds out versions of Python, Pandas, Numpy, Sklearn and XGBoost libraries.
 
import sys
import pandas
import numpy
import sklearn
import xgboost

print ("Python Version: " + sys.version)
print ("Pandas Version: " + pandas.__version__)
print ("Numpy Version: " + numpy.version.version)
print ("Scikit-Learn Version: " + sklearn.__version__)
print ("XGBoost Version: " + xgboost.__version__)

Press Ctrl + Enter. 
You should get following result:

Python Version: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Pandas Version: 0.23.4
Numpy Version: 1.15.4
Scikit-Learn Version: 0.20.1
XGBoost Version: 0.81

Note: If you get ModuleNotFoundError error for XGBoost library, try running following code to install XGBoost through code. 

import sys
!{sys.executable} -m pip install xgboost

What is the cause of Overfitting and how can we avoid it in Machine Learning?

How to avoid overfitting in Machine Learning? What are the various ways to deal with the overfitting of the data in Machine Learning? This is the very important question to consider in the world of Data Science and Machine Learning. 

Lets start this discussion with a small story of two brothers: Ram and Sham.

Ram and Sham study in First standard. Tomorrow is their Maths exam. Syllabus contains 20 questions (5 questions each from Addition, Subtraction, Multiplication and Division questions).

5 Addition questions in the syllabus are:

2+5=7
3+2=5
1+8=9
4+2=6
7+1=8

Ram learnt only Addition and Subtraction.
Sham memorized all the 20 questions.

Exam Result: Ram just managed to pass while Sham became one of the toppers.

Next day, Sham was given following Addition problems to solve:

3+2 (Sham answered 5 because this was in syllabus and he had memorized it)
1+1 (Sham failed to answer this simple question because this was not in the syllabus)
2+1 (Sham failed to answer this simple question because this was not in the syllabus)

Ram was able to solve above all but was not able to solve Multiplication and Division problems.

Story finished. Now lets relate it to Underfitting and Overfitting concepts in Machine Learning

Consider Maths syllabus as training dataset.

Ram learn only small portion of the syllabus. This is Underfitting. Algorithm is not aware of all the scenarios, so will not be able to predict those scenarios in the test dataset as it is not trained in the training dataset. But it will be able to generalize the scenarios which it learnt.

Sham memorized all the 20 questions. This is Overfitting. Algorithm will not be able to generalize the data in the test dataset and will result in high variance.

So, following points should be noted down regarding Overfitting and Underfitting:

1. A good algorithm should have low/reasonable bias in the training dataset. Then it will also have low variance in test dataset which is good sign of a consistent algorithm.

2. If an algorithm overfits in the training dataset (have zero bias), there is a lot of possibility that it will have high variance in the test dataset which is a bad.

3. When a model tries to overfit, it loses its generalization capacity, due to which its shows poor performance in the test dataset.

4. The model which tries to overfit the training set, mainly becomes too complex.

5. The model which tries to underfit the training set, mainly becomes too simple.

What causes Overfitting? 

1. Small Training Dataset

How to avoid Overfitting? 

We need to find out a way in-between of overfit and underfit. This is achieved by following techniques:

1. Cross Validation Techniques

2. Regularization Technique (Ridge Regression - L1, Lasso Regression - L2, Elastic Net Regression)

3. Ensemble Learning Techniques (Bagging and Boosting)

  • Bagging: Random Forest
  • Boosting: AdaBoost, Gradient Boosting Machine (GBM), XGBoost 

I will discuss all these techniques in detail in my next post.

Monday, 11 February 2019

Why is Dimensionality Reduction required in Machine Learning?

Dimensionality Reduction is a very important step in Machine Learning. Below are the advantages of Dimensionality Reduction in Machine Learning:

1. Reduction in Computation Time: Less dimensions lead to less computation/training time which increase the performance of the algorithm.

2. Improves Algorithm Performance: Some algorithms do not perform well when we have large dimensions in dataset. So by reducing these dimensions, we can increase the performance of the algorithm.

3. Removes Multicollinearity and Correlated variables: Multicollinearity occurs when independent variables in a model are correlated. This correlation is a problem because independent variables should be independent. It takes care of multicollinearity by removing redundant features. 

For example, you have two variables – ‘time spent on treadmill in minutes’ and ‘calories burnt’. These variables are highly correlated as the more time you spend running on a treadmill, the more calories you will burn. Hence, there is no point in storing both as just one of them does what you require.

4. Better Data Visualization: It helps in visualizing the data in a better way. It is very difficult to visualize data in higher dimensions so reducing our space to 2D or 3D may allow us to plot and observe patterns more clearly.

5. Less Storage Required: Space required to store the data is reduced as the number of dimensions comes down.

7 Basic Types of Machine Learning Algorithms You Must Know

I have listed down 7 types of Machine Learning algorithms which you must know. You should have thorough knowledge of these algorithms and techniques. Why and where these algorithms are used, what is the mathematics behind these, how are these algorithms implemented in Python and R, how to measure performance of these algorithms etc? 

Below is the list of basic types of Machine Learning algorithms:

1. Classification Algorithms
  • KNN (K-Nearest Neighbors)
  • Naive Bayes
  • Decision Trees and Random Forest
  • SVM (Support Vector Machine)
2. Regression Algorithms
  • Linear Regression
  • Logistic Regression
3. Clustering and Association Algorithms
  • K-Means Clustering
4. Dimensionality Reduction Techniques 
  • Feature Selection and Feature Extraction
  • PCA (Principal Component Analysis)
  • SVD (Singular Value Decomposition)
  • LDA (Linear Discriminant Analysis)
  • MDS (Multi-Dimension Scaling)
  • t-SNE (t-Distributed Stochastic Neighbor Embedding)
  • ICA (Independent Component Analysis)
5. Regularization 
  • Ridge Regression (L1 Regularization)
  • Lasso Regression (L2 Regularization)
  • Elastic-Net Regression
6. Ensemble Learning Techniques and Algorithms 
  • Bagging and Boosting
  • Random Forest
  • AdaBoost
  • Gradient Boosting Machine (GBM)
  • XGBoost
7. Time Series Analysis and Sentiment Analysis

I will keep adding more algorithms and techniques to the list in future.

Dimensionality Reduction: Feature Selection and Feature Extraction Techniques in Machine Learning

Whenever you get any dataset, you don't directly jump to implement a model from it. Instead, you first and most important task is the analyze the data and clean it. This task consumes most of the time in Machine Learning. Dimensionality Reduction is one of the most important task at this phase. 

We will discuss various Dimensionality Reduction techniques in this article. I will not go in detail of each technique because it will drastically increase the length of this blog post. So, I will keep it short and simple.

Dimensionality Reduction is used to reduce the number of features or variables in the dataset without losing much information and improve the performance of the model.

Dimensionality Reduction can be done in two ways:

1. Feature Selection: Remove unwanted variables

2. Feature Extraction: Extract important variables. Find a smaller set of new variables, each being a combination of the original variables, containing basically the same information as the original variables.

Feature Selection Techniques:

1. Handle variables with missing values
2. Check for variance in a variable 
3. Check for correlation between two variables
4. Random Forest
5. Backward Feature Elimination
6. Forward Feature Selection

Feature Extraction Techniques:

1. Factor Analysis
2. PCA (Principal Component Analysis
3. SVD (Singular Value Decomposition)
4. LDA (Linear Discriminant Analysis)
5. MDS (Multi-Dimension Scaling)
6. t-SNE (t- Distributed Stochastic Neighbor Embedding)
7. ICA (Independent Component Analysis)

Lets elaborate above Dimensionality Reduction techniques:

Feature Selection Techniques:

1. Handle variables with missing values

1. If the count of missing values in a variable or a feature is greater than the threshold value, then remove the variable.

2. If there are not so many missing values in a variable or a feature, then you can do following:

  • If it is a numerical variable, then you can replace the missing value by finding the mean, median or standard deviation of the variable.
  • If it is a categorical variable, then you can replace the missing value by introducing a new category or class.

2. Check for variance in a variable 

You can drop a variable with zero or low variance because the variables with low variance will not affect the target variable. If all the values in a variable are approximately same, then you can easily drop this variable. 

For example, if almost all the values in a numerical variable contain 1, then you can drop this variable.

3. Check for correlation between two variables

High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance). 

For example, we can calculate the correlation between independent numerical variables. If the correlation coefficient crosses a certain threshold value, we can drop one of the variables (dropping a variable is highly subjective and should always be done keeping the domain in mind).

As a general guideline, we should keep those variables which show a decent or high correlation with the target variable.

4. Random Forest

Random Forest is one of the most widely used algorithms for feature selection. This helps us select a smaller subset of features.

This topic requires broad discussion. So, will make a separate post for this.

5. Backward Feature Elimination

1. Create a model with all variables (say n variables) and test its performance.

2. Remove one variable at a time, prepare the model with n-1 variables and test its performance. If there is no impact or least impact on the performance of the model, you can consider removing this variable.

3. Keep repeating this process for all the variables and check if you want to retain or drop that variable.

6. Forward Feature Selection (opposite of Backward Feature Elimination)

1. Prepare a model with one variable and test its performance.

2. Add another variable and again test its performance. If there is significant gain in performance in the model, then you can consider retaining this variable, otherwise you can drop the variable.

3. Keep repeating this process for all the variables and check if you want to retain or drop that variable.

Feature Extraction Techniques:

1. Factor Analysis

2. PCA (Principal Component Analysis)

3. SVD (Singular Value Decomposition)

4. LDA (Linear Discriminant Analysis)

5. MDS (Multi-Dimension Scaling)

6. t-SNE (t- Distributed Stochastic Neighbor Embedding)

7. ICA (Independent Component Analysis)

The above list requires detailed elaboration. So, I will discuss all of them in my future posts.

Friday, 8 February 2019

Difference between Covariance and Correlation in Machine Learning

Covariance and Correlation are two important concepts of Mathematics (especially of Probability and Statistics) which are heavily used in Machine Learning mainly for Data Analysis and Data Wrangling. 

Dimensionality Reduction in Machine Learning mainly depends upon Covariance and Correlation among different variables or features in the dataset. For example, PCA (Principal Component Analysis) algorithm uses Correlation concept for Feature Extraction. 

Covariance and Correlation describe the relationship and inter-dependence between two variables. Both Covariance and Correlation depict how the change in one variable affects the change in another variable. Relationship between two variables or features can be positive relationship, negative relationship or there could be no relationship at all.

Difference between Covariance and Correlation

1. Correlation between the two variables is a normalized version of the Covariance

To calculate the Correlation between random variables X and Y, we need to divide the Covariance of X and Y by the product of the Standard Deviation of X and the Standard Deviation of Y.


(Image taken from acadgild)

As per the above equation, a positive Covariance always results in a positive Correlation and a negative Covariance always results in a negative Correlation.

2. Covariance varies from negative infinity to positive infinity while Correlation varies from -1 to 1. If the Correlation between the two variables is 0.85, you can say that change in one variable results in similar change in other variable. So, both the variables called correlated with each other. 

3. Covariance is Unit Dependent while Correlation is Unit Independent (it means Correlation is dimensionless).

4. Covariance is Scale Dependent while Correlation is Scale Independent. It means that difference in scale can result in different Covariance. For example, Height vs Weight (in Kg) and Height vs Weight (in Pounds) will have different Covariance values but same Correlation value.