Wednesday, 20 March 2019

What is Multicollinearity? What is Structural and Data Multicollinearity?

Multicollinearity is a situation in which two or more predictor (independent variables) in a model are highly correlated. 

For example, you have two explanatory variables – ‘time spent on treadmill in minutes’ and ‘calories burnt’. These variables are highly correlated as the more time you spend running on a treadmill, the more calories you will burn. Hence, there is no point in storing both as just one of them does what you require. Generally, if the correlation between the two independent variables is high (>= 0.8), then we drop one independent variable otherwise it may lead to multicollinearity  problem. 

If the degree of multicollinearity between the variables independent variables is high enough, it can cause problems when you fit the model and interpret the results.

Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them.

Types of Multicollinearity

1. Structural Multicollinearity: This type of multicollinearity occurs when we create a variable based on the another variable while creating the model. For example, there is a variable say "x", and you create another variable based on the "x" say "y" where y=cx (c is any constant). In this case, both "x" and "y" are correlated variables.

2. Data Multicollinearity: This type of multicollinearity is present in the data itself. So, we need to identify it during data wrangling process.

How to remove correlated variables?

Following techniques are used to handle multicollinearity problem in a dataset:

1. PCA (Principal Component Analysis)
2. SVD (Singular value Decomposition)

Related: Covariance vs Correlation

Tuesday, 19 March 2019

Data Wrangling Techniques: Steps involved in Data Wrangling Process

Data Wrangling is the first step we take while creating a Machine Learning model. This is the main step in which we prepare data for a Machine Learning algorithm. This step is very crucial and takes up to 60 to 80 percent of time. 

In Data Wrangling, we convert the raw data into a suitable format which we can input to any Machine Learning algorithm. Data Wrangling and Data Preprocessing terms are used interchangeably. We will see what are the various steps one should take while Data Wrangling?

Steps involved in Data Wrangling

Below are the basis steps involved in the Data Wrangling process:

1. Load, Explore and Analyze your data

Load the dataset into pandas dataframe and run some data exploratory methods and visualization techniques.

2. Drop the not required columns like columns containing IDs, Names etc. 

For example, in Titanic dataset, we can easily remove Passenger Id, Passenger Name and Ticket Number which are not required for any kind of prediction. Read more...

3. Drop the columns which contain a lot of null or missing values

The columns which contain around 75% of missing values should be removed from the dataset. For example, in Titanic dataset, cabin column contains 687 null values out of 891 observations (77% missing values). So, it makes sense to remove this column from the dataset. Read more... 

4. Remove the rows which contain null values

If there are around 10-15% observations which contain null values, we can consider removing those observations. Read more...

5. Impute missing values

Step 4 is usually not recommended as you may lose significant data. So, better try imputing the missing values with some meaningful values.

For numeric columns, you can impute the missing values with mean, median or mode. Read more...

For categorical columns, you can impute the missing values by introducing a new category. Read more...

6. Replace invalid values

Many times there are invalid values present in the dataset. For example, in Pima Indian Diabetes dataset, there are zero values for Blood Pressure, Glucose, Insulin etc. which is invalid. So, we need to replace these values with some meaningful values. Domain knowledge plays a crucial role in identifying the invalid values. Read more...

7. Transform categorical variables to dummy variables

To transform categorical variables to dummy variables, we can use LabelEncoder, OneHotEncoder and get_dummies methods present in Scikit Learn and Pandas library in Python.

8. Feature Scaling and Normalization

We need to standardize and normalize all the features in the dataset. The idea behind standardization is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Read more...

9. Dimensionality Reduction

Dimensionality reduction is required to maximize the performance of the model. Please go through my previous posts on dimensionality reduction to understand the need of this step.

Covariance vs Correlation
Feature Selection and Feature Extraction
Need of Dimensionlity Reduction
PCA, t-SNE, PCA vs t-SNE 

10. Splitting the dataset into training and testing data

We should not use the entire dataset to train a model. We should keep aside around 20% of data to test the accuracy of the model. So, usually we maintain a ratio of 75:25 between training and testing datasets.

Monday, 18 March 2019

Data Wrangling: How to remove invalid values from dataset in Python

Domain knowledge plays a crucial role in data wrangling. Sometimes, there are no missing values in the dataset but there are a lot of invalid values which we need to manually identify and remove those invalid values.

For example, consider "Pima Indians Diabetes" dataset which predicts the onset of diabetes within 5 years in Pima Indians given medical details. This dataset has a lot of invalid values which we will try to remove in the article.

Lets load this dataset. You can download it from here.

import pandas as pd
import numpy as np

names = ['PregCount', 'Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI', 'Pedigree', 'Age', 'Class']
dataset = pd.read_csv("C:\\datasets\\pima-indians-diabetes.csv", names=names) 

dataset.shape

This dataset has 768 observations and 8 parameters like:

1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).

Lets see count of null or missing values in this dataset:

dataset.isnull().sum()

We find out there are no missing values in it. Not lets find statistics of the data:

dataset.describe()

We find that following columns have min zero value:

1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).

Here domain knowledge plays a vital role. Although there are no null or missing values in this dataset, but there are a lot of invalid values (zero) in the above column. Lets see how many zero values are there in the above columns:

print((dataset[["Glucose", "BP", "SkinFold", "Insulin", "BMI"]] == 0).sum())  

Glucose - 5
BP - 35
SkinFold - 227
Insulin - 374
BMI - 11

In order to handle these invalid zero values, we will mark these values as NaN. 

dataset[["Glucose", "BP", "SkinFold", "Insulin", "BMI"]] = dataset[["Glucose", "BP", "SkinFold", "Insulin", "BMI"]].replace(0, np.NaN)

Now print null values:

dataset.isnull().sum()

PregCount - 0
Glucose - 5
BP - 35
SkinFold - 227
Insulin - 374
BMI - 11
Pedigree - 0
Age - 0
Class - 0

NaN values are ignored from operations like sum, count etc. As these are numeric values, we can take mean, median or mode of these values and replace the zero with those values. To know more about it, please go through my this post.

Sunday, 17 March 2019

Data Wrangling: How to handle missing values in categorical column in a dataset?

In my previous article, we had seen how to impute missing values in numeric columns?Today, we will see how can we impute missing values in categorical columns? 

Again, we will take example of titanic dataset. There are two categorical columns (Cabin and Embarked) in the titanic dataset which have missing values. Cabin has 687 missing values (out of 891), so its better to drop this column as it has more than 77% of null data. So, lets concentrate on Embarked column which has only 2 missing values.

In categorical columns, we introduce a new category usually called "Unknown" to impute missing values. As this column has values 'S', 'C' and 'Q' categories. Lets impute 'U' (Unknown) as a new category for 2 missing values.

dataset["Embarked"].fillna("U")

If you think you should not drop "Cabin" column, you can try imputing missing cabin category in a same way we did for "Embarked" column. Its up to you. Please note that introducing a new category ("Unknown") which is not a part of original dataset may lead to variance in the prediction. 

As it is a categorical variable, next step is to hot encode this column. I have already explained this step here. You can also use get_dummies method of Pandas to one hot encode this categorical variable. So before one hot encoding any categorical column, you must search for the missing values in it and if found, you must impute it with unknown category.

Data Wrangling: How to handle missing values in numeric column in a dataset?

Handling missing values in the dataset is a very common issue and there are various ways to deal with it. Today, we will see how to handle missing values in the numeric column. For this, we will consider the "Age" column (which contains numeric values) from the titanic dataset. 

Note: Please go through my previous post where I have loaded this titanic dataset.

Lets see how many missing values are there in the "Age" column of titanic dataset?

dataset["Age"].isnull().sum()

We find there are 177 missing values out of 891 observations. Now how to handle these 177 missing values? 

The general method to handle such kind of scenarios is to replace the missing values with some meaningful value. This meaningful values can be obtained by taking the mean, median or mode of all the not null values in the "Age" column. This is a statistical approach of handling the missing values and is well suited for linear data. 

Step 1: Calculate mean, median and mode 

mean_age = dataset["Age"].mean()
median_age = dataset["Age"].median()
mode_age = dataset["Age"].mode()
display("Mean Age: " + str(mean_age))
display("Median Age: " + str(median_age))
display("Mode Age: " + str(mode_age))

Output

'Mean Age: 29.69911764705882'
'Median Age: 28.0'
'Mode Age: 0    24.0\ndtype: float64'

Step 2: Replace the missing values 

Replace the missing values in the "Age" column with any of the above calculated values. In this case, I am going to replace the missing values with the mean value.

dataset["Age"].replace(np.NaN, mean_age)

Please note that, this is just an approximation of the missing values and it may lead to variance in the prediction but we have to deal with it. There is no way around. But this approach is far better than dropping the "Age" column due to which we will lose a lot of significant data.

Saturday, 16 March 2019

Difference between Label Encoder and One Hot Encoder in Python (Scikit Learn Library)

SciKit Learn library contains Label Encoder and One Hot Encoder. These two encoders are used to convert categorical data into numbers (zeros and ones). We will implement them and also see the difference between them. There is a similar method in Pandas library called get_dummies which does the same. You can see more details on get_dummies in my this post.

Consider the titanic dataset in which we have "Sex" column which contains "male" and "female" values. As these are string values, we need to convert these categorical values to numbers before using this data in any machine learning algorithm. 

I my previous article, I had used get_dummies to generate new columns "male" and "female" which contain zeros and ones. Now will we use Label Encoder and One Hot Encoder for the same purpose. While implementing both the encoders, the difference between between them will also get cleared.

Step 1: Use LabelEncoder to convert "male" and "female" to zeros and ones

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])

In the above line, I am assuming "Sex" is the first column in my dataset. You can change the index as per your dataset. 

After running the above code, I will have all the zeros and ones under the "Sex" column. LabelEncoder does this part. It does not creates new columns corresponding to each categorical value. For this, we need to further use One Hot Encoder on the Label Encoded values. 

Why we need to one hot encode the label encoded values?

Consider the situation where I have more than two categorical values. For example, in our titanic dataset, there is a column called Embarked which has 3 categorical values ('S', 'C', 'Q'). Label Encoder will convert these values into 0, 1 and 2. Although in the original dataset, there is no relation between 'S', 'C' and 'Q' but after label encoding it appears that there is some kind of relation like 'Q' > 'C' > 'S' (which is not true) as 'Q' is encoded to 2, 'C' is encoded to 1 and 'S' is encoded to 0. So, in order to remove this confusion, we need to further use one hot encoding on it to create different columns corresponding to 'S', 'C' and 'Q' which will contain only zero and ones.

Step 2: Convert the Label Encoded values to One Hot Encoded values

One Hot Encoder takes a column which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by ones and zeros, depending on which column has what value.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()

Data Wrangling: Convert Categorical Variables into Dummies (Numbers: 0 and 1) in Python using Pandas Library

Categorical variables are those variables which contain categorical values. For example, consider the "Sex" column in our titanic dataset. It is categorical variable containing male and female. We need to convert these categories (male and female) into numbers (0 and 1) because most of the machine learning algorithms don't accept string values.

II found 3 categorical columns (Sex, Embarked and Pclass) in the titanic dataset. So, lets convert them into numbers (0 and 1). Or, we can say lets one hot encode these variables.

Note: Before going through this article, its required that you go through my previous article where I had loaded this titanic dataset and removed the null values from it. This article is in continuation to my previous article on removing null values from dataset.

Pandas library in Python contains get_dummies method which does the one hot encoding of the categorical variables (converts them into numbers - 0 and 1). The method get_dummies creates a new dataframe which consists of zeros and ones. 

Step 1: Convert categorical variables to their respective one hot encoded representation

sex = pd.get_dummies(dataset["Sex"], drop_first=True)
embark = pd.get_dummies(dataset["Embarked"], drop_first=True)
pclass = pd.get_dummies(dataset["Pclass"], drop_first=True)

If you want to keep all the newly created columns, then don't use drop_first parameter. If you want to view the data which got converted into zeros and ones, use head() method.

sex.head()
embark.head()
pclass.head()

Step 2: Concatenate all the one hot encoded columns to the original dataset

dataset = pd.concat([dataset, sex, embark, pclass], axis=1)

Step 3: Drop original columns

As we have already one hot encoded the Sex, Embarked and Pclass columns, lets drop these columns from the original dataset.

dataset.drop(['Sex', 'Embarked', 'Pclass'], axis=1, inplace=True)

There are also some columns (like PassengerId, Name and Ticket) which are not going to contribute in any kind of prediction. Also, Name and Ticket columns contain string values. So, its better to remove them. 

dataset.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)

I will write a complete post on Label Encoding and One Hot Encoding in my upcoming articles. So, stay tuned.

Friday, 15 March 2019

Data Wrangling: Removing Null Values from Dataset in Python using Pandas Library

Removing null values from the dataset is one of the important steps in data wrangling. These null values adversely affect the performance and accuracy of any machine learning algorithm. So, it is very important to remove null values from the dataset before applying any machine learning algorithm to that dataset. Although some algorithms have built-in feature to eliminate null values, but we should also do it manually while preparing the data.

We will use Python Library (pandas) to remove null values from the very famous Titanic dataset. Lets try it out.

Step 1: Import the required Python libraries (We only need pandas)

import pandas as pd

Step 2: Load and examine the dataset (Data Exploration)

dataset = pd.read_csv("C:\\datasets\\titanic.csv")
dataset.shape
dataset.info()
dataset.head()

You can download titanic dataset from Kaggle. There are 891 observations and 12 features in this dataset.

Step 3: Data Wrangling (Removing null values)

In this tutorial, we will just remove null values from our titanic dataset as a part of data wrangling step in order to make our article short and crisp.

Step 3.1: Lets see how many null values are there in our dataset?

dataset.isnull()

This will display the entire dataset in terms of True and False. Not Null values are represented by False and Null values are represented by True. I am not going to copy paste the output here. Please practice yourself and see the output. You will not achieve anything just by reading the article, you need to practice simultaneously. Sorry for the suggestion.

Step 3.2: Lets display which column / feature contains how many null values?

dataset.isnull().sum()

From the output, it is clear that Age column contains 177 null values and Cabin column contains 687 null values.

Step 3.3: Lets drop the Cabin column.

We see that Cabin column contains 687 null values out of 891 rows / observations. So it makes sense to drop this column from the dataset. So, lets drop it.

dataset.drop("Cabin", axis=1, inplace=True)

Please note that inplace parameter is used to permanently affect our dataset. By default it is false. If we don't set it to True, the Cabin column is not dropped permanently from our dataset.

Step 3.4: Lets drop all rows in the dataset which contain null values.

dataset.dropna(inplace=True)

It will remove all the rows which contain null values for the dataset. Now our dataset does not contain any null value. This step is not recommended. I added this step just for illustration. We can loose significant information by executing this step. There are methods to replace the null values with some meaningful values. We will explore those methods in my next post of data wrangling.

Thursday, 14 March 2019

Difference between Ridge Regression (L2 Regularization) and Lasso Regression (L1 Regularization)

Regularization is mainly used to solve the overfitting problem in Machine Learning algorithms and helps in generalizing the prediction ability of ML algorithms. 

If a model is simple, it may be the case that it is not exposed to the significant amount of training data and it may underfit. This model will not be able to generalize the data. 

A complex model can also capture the noisy data which is totally irrelevant to our predictions. This model may perform well in the training data but will not perform well in test data due to overfitting. 

We need to choose the right model in between the simple and the complex model. Regularization helps to choose the preferred model complexity, so that model does not overfit and is better at generalization. 

Regularization is of 3 types:

1. Ridge Regression (L2 Regularization)
2. Lasso Regression (L1 Regularizaion)
3. Elastic Net Regreesion 

Regularization adds some amount of bias (called Regularization Penalty) to the objective function and in return the algorithm gets significant drop in the variance. 

For example, Linear Regression tries to minimize the Loss Function (lets say Sum of the Squared Errors) to get the best fit line. In order to prevent this model from overfitting, we can add Regularization Penalty to the Loss Function. Now the model has to minimize both the Loss Function and the Regularization Penalty. 

The severity of the penalty is found by cross validation. In this way, the final model will never overfit. The severity of the penalty can vary from 0 to positive infinity. If severity is zero, it means we are not considering the regularization at all in our model.

Difference between Ridge Regression (L2 Regularization) and Lasso Regression (L1 Regularization)

1. In L1 regularization, we penalize the absolute value of the weights while in L2 regularization, we penalize the squared value of the weights.

2. In L1 regularization, we can shrink the parameters to zero while in L2 regularization, we can shrink the parameters to as small as possible but not to zero. So, L1 can simply discard the useless features in the dataset and make it simple.

When to use what?

There is no any hard and fast rule. If you need to eliminate some useless features from the dataset, L1 should be preferred. But, if you cannot afford to eliminate any feature from your dataset, use L2. In fact we should try both L1 and L2 regularization and check which results in better generalization. We can also use Elastic Net Regression which is combines the features of both L1 and L2 regularization.

Tuesday, 12 March 2019

What is t-SNE? How does it work using t-Distribution?

t-SNE stands for t-Distributed Stochastic Neighbor Embedding. It is a non-linear dimensionality reduction algorithm. t-SNE uses normal distribution (in higher dimension) and t-Distribution (in lower dimension) to reduce the dimensions of the dataset. We will see it in detail. 

As per documentation:

“t-Distributed stochastic neighbor embedding (t-SNE) minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding”.

t-SNE is a technique to convert high-dimensional data into lower dimensional data while keeping the relative similarity of the data points as close to the original (in high dimensional space) as possible.

Lets see how does t-SNE work? I will just illustrate a high level understanding of the algorithm (because even I don't understand the complex mathematics behind it :)).

Higher Dimension

Step 1: Picks a data point (say P1), calculates its Euclidean distance from a neighboring data point (say P2) and converts this distance into the conditional probability using normal distribution. 

Conditional probability represents the similarity between the pairs of data points.

Step 2: Again it calculates the distance of P1 from other neighboring point (say P3) and does the same as in Step 1.

It keeps doing same thing for all the data points. In this way t-SNE keeps computing pairwise conditional probabilities for each data point using normal distribution in higher dimension.

To summarize, t-SNE measures the similarity between each and every pair of data points. Similar data points will have more value of similarity and the different data points will have less value. Then it converts that similarity distance to the conditional probability according to the normal distribution. It also creates a similarity matrix (say S1).

Lower Dimension

Step 3: t-SNE arranges all of the data points randomly on the required lower dimension (say two dimensional space).

Step 4: It again does the same calculations for all the data points in the lower dimension as it did for all the data points in the higher dimension in Step 1 and Step 2. The only difference is that it uses t-Distribution in this case instead of normal distribution. That is why it is called t-SNE instead of simple SNE.

It also creates a similarity matrix in lower dimension (say S2).

Now t-SNE compares the similarity matrix S1 and S2 and tries to minimize the difference such that all the pairs have a similar probability distribution (both in higher and lower dimension). Gradient Descent is used with Kullback Leibler Divergence between the two distributions as a cost function. To measure the minimization of sum of difference of conditional probability, SNE minimizes the sum of Kullback-Leibler divergences of overall data points using a Gradient Descent method.

Difference between normal distribution and t-Distribution

t-Distribution is a lot like a normal distribution. The only difference is that t-distribution is not as tall as normal distribution in middle but its tails are taller at the ends. 

Why is t-Distribution used instead of normal distribution in lower dimension because without it the clusters would all clump up in the middle and will be harder to visualize.

Related:

Advantages and Disadvantages of t-SNE over PCA (PCA vs t-SNE)
Advantages and Disadvantages of Principal Component Analysis
Dimensionality Reduction: Feature Selection and Feature Extraction
Why is Dimensionality Reduction required in Machine Learning?

Advantages and Disadvantages of t-SNE over PCA (PCA vs t-SNE)

Both PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are the dimensionality reduction techniques in Machine Learning and efficient tools for data exploration and visualization. In this article, we will compare both PCA and t-SNE. We will see the advantages and disadvantages / limitations of t-SNE over PCA.

Advantages of t-SNE

1. Handles Non Linear Data Efficiently: PCA is a linear algorithm. It creates Principal Components which are the linear combinations of the existing features. So, it is not able to interpret complex polynomial relationship between features. So, if the relationship between the variables is nonlinear it performs poorly. On the other hand t-SNE works well on non-linear data. It is a very effective non-linear dimensionality reduction algorithm. 

PCA tries to place dissimilar data points far apart in a lower dimension representation. But in order to represent high dimension data on low dimension, non-linear manifold, it is important that similar datapoints must be represented close together, which is not what PCA does. This is done efficiently by t-SNE. So, it can efficiently capture the structure of trickier manifolds in the dataset.

2. Preserves Local and Global Structure: t-SNE is capable to preserve local and global structure of the data. This means, roughly, that points which are close to one another in the high-dimensional data set will tend to be close to one another in the low dimension. On the other hand, PCA finds new dimensions that explain most of the variance in the data. So, it cares relatively little about local neighbors unlike t-SNE.

Disadvantages of t-SNE

1. Computationally Complex: t-SNE involves a lot of calculations and computations because it computes pairwise conditional probabilities for each data point and tries to minimize the sum of the difference of the probabilities in higher and lower dimensions.

“Since t-SNE scales quadratically in the number of objects N, its applicability is limited to data sets with only a few thousand input objects; beyond that, learning becomes too slow to be practical (and the memory requirements become too large)”.

t-SNE has a quadratic time and space complexity in the number of data points. This makes it particularly slow, computationally quite heavy and resource draining while applying it to datasets comprising of more than 10,000 observations. 

Use both PCA and t-SNE: Solution of the above problem is to use both PCA and t-SNE in conjunction. So, if you have thousands of features in a dataset, don't use t-SNE for dimensionality reduction in the first step. First use PCA to reduce the dimensions to a reasonable number of features and then run t-SNE to further reduce the dimensionality.

2. Non-deterministic: Sometimes different runs with same hyper parameters may produce different results. So, you won't get exactly the same output each time you run it though the results are likely to be similar.

3. Requires Hyperparameter Tuning: t-SNE involves hyperparameters to be tuned unlike PCA (does not have any hyperparameter). Handing hyperparameters incorrectly may lead to unwanted results.

4. Noisy Patterns: Patterns may be found in random noise as well, so multiple runs of the algorithm with different sets of hyperparameter must be checked before deciding if a pattern exists in the data.

Related:

What is t-SNE? How does it work using t-Distribution?
Advantages and Disadvantages of Principal Component Analysis
Dimensionality Reduction: Feature Selection and Feature Extraction
Why is Dimensionality Reduction required in Machine Learning?

Sunday, 10 March 2019

Implement XGBoost with K Fold Cross Validation in Python using Scikit Learn Library

In this post, we will implement XGBoost with K Fold Cross Validation technique using Scikit Learn library. We will use cv() method which is present under xgboost in Scikit Learn library. You need to pass nfold parameter to cv() method which represents the number of cross validations you want to run on your dataset. 

Before going through this implementation, I highly recommend you to have a look at normal implementation of XGBoost in my previous post.

Step 1: Import the required Python libraries like pandas and sklearn

import pandas as pd
from sklearn.datasets import load_boston
import xgboost as xgb

Step 2: Load and examine the dataset (Data Exploration)

dataset = load_boston()
dataset.keys()
dataset.data
dataset.target
dataset.data[0:5]
dataset.target[0:5]
dataset.data.shape
dataset.target.shape
dataset.feature_names
print(dataset.DESCR)

#convert the loaded dataset from scikit learn library to pandas library

data = pd.DataFrame(dataset.data)
data.columns = dataset.feature_names
data.head()
data['PRICE'] = dataset.target
data.head()
data.info()
data.describe() 

Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.

I have not added details of the above exploratory methods. These are very simple methods and I am leaving it up to you to practice and explore yourself.

Step 3: Mention X and Y axis

X, y = data.iloc[:,:-1], data.iloc[:,-1]

X contains the list of attributes
Y contains the list of labels

Step 4: Convert dataset into DMatrix

Lets convert our dataset into an optimized data structure called DMatrix that XGBoost supports and delivers high performance and efficiency gains. 

data_dmatrix = xgb.DMatrix(data=X, label=y)

Step 5: Create the model

Lets create a hyper-parameter dictionary params which holds all the hyper-parameters and their values as key-value pairs but will exclude the n_estimators from the hyper-parameter dictionary because we will use num_boost_rounds instead.

params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)  

Explanation of above statement:

nfold = 3: we want to run 3 Fold Cross Validation.
metrics = rmse: we want to use root mean square error to check the accuracy.
num_boost_round = 50: number of trees you want to build (analogous to n_estimators)
early_stopping_rounds = 10: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.
as_pandas: returns the results in a pandas DataFrame. So, cv_results is an pandas DataFrame in above statement.
seed: for reproducibility of results.

Step 6: Examine the results

cv_results.head()
print((cv_results["test-rmse-mean"]).tail(1))

I get the RMSE for the price prediction around 4.03 per 1000$. You can reach an even lower RMSE for a different set of hyper-parameters. You may consider applying techniques like Grid Search, Random Search and Bayesian Optimization to reach the optimal set of hyper-parameters.

Related:
Implement XGBoost in Python using Scikit Learn Library
Difference between GBM and XGBoost
Advantages of XGBoost Algorithm

Saturday, 9 March 2019

Advantages of XGBoost Algorithm in Machine Learning

XGBoost is an efficient and easy to use algorithm which delivers high performance and accuracy as compared to other algorithms. XGBoost is also known as regularized version of GBM. Let see some of the advantages of XGBoost algorithm:

1. Regularization: XGBoost has in-built L1 (Lasso Regression) and L2 (Ridge Regression) regularization which prevents the model from overfitting. That is why, XGBoost is also called regularized form of GBM (Gradient Boosting Machine).

While using Scikit Learn libarary, we pass two hyper-parameters (alpha and lambda) to XGBoost related to regularization. alpha is used for L1 regularization and lambda is used for L2 regularization.

2. Parallel Processing: XGBoost utilizes the power of parallel processing and that is why it is much faster than GBM. It uses multiple CPU cores to execute the model.

While using Scikit Learn libarary, nthread hyper-parameter is used for parallel processing. nthread represents number of CPU cores to be used. If you want to use all the available cores, don't mention any value for nthread and the algorithm will detect automatically.

3. Handling Missing Values: XGBoost has an in-built capability to handle missing values. When XGBoost encounters a missing value at a node, it tries both the left and right hand split and learns the way leading to higher loss for each node. It then does the same when working on the testing data.

4. Cross Validation: XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run. This is unlike GBM where we have to run a grid-search and only a limited values can be tested.

5. Effective Tree Pruning: A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm. XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.

For example: There may be a situation where split of negative loss say -4 may be followed by a split of positive loss +13. GBM would stop as it encounters -4. But XGBoost will go deeper and it will see a combined effect of +9 of the split and keep both.

Related: Difference between GBM and XGBoost

Implement XGBoost in Python using Scikit Learn Library in Machine Learning

XGBoost is an implementation of Gradient Boosting Machine. XGBoost is an optimized and regularized version of GBM. In this post will try to build a model using XGBRegressor to predict the prices using Boston dataset. To know more about XGBoost and GBM, please consider visiting this post.

Step 1: Import the required Python libraries like pandas, numpy and sklearn

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error  

Step 2: Load and examine the dataset (Data Exploration)

dataset = load_boston()
dataset.keys()
dataset.data
dataset.target
dataset.data[0:5]
dataset.target[0:5]
dataset.data.shape
dataset.target.shape
dataset.feature_names
print(dataset.DESCR)

#convert the loaded dataset from scikit learn library to pandas library

data = pd.DataFrame(dataset.data)
data.columns = dataset.feature_names
data.head()
data['PRICE'] = dataset.target
data.head()
data.info()
data.describe() 

Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.

I have not added details of the above exploratory methods. These are very simple methods and I am leaving it up to you to practice and explore yourself.

Step 3: Mention X and Y axis

X, y = data.iloc[:,:-1], data.iloc[:,-1]

X contains the list of attributes
Y contains the list of labels

Step 4: Split the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0) 

Step 5: Create and fit the model

model = XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10)

model.fit(X_train, y_train)  

I will write a separate post explaining above hyperparameters of XGBoost algorithm.

Step 6: Predict from the model

y_pred = model.predict(X_test)

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.

Lets see the difference between the actual and predicted values.

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df 

Step 7: Check the accuracy

meanAbsoluteError = mean_absolute_error(y_test, y_pred)
meanSquaredError = mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print('Mean Absolute Error:', meanAbsoluteError)  
print('Mean Squared Error:', meanSquaredError)  
print('Root Mean Squared Error:', rootMeanSquaredError)

Friday, 8 March 2019

Implement AdaBoost in Python using Scikit Learn Library in Machine Learning

We are going to implement Adaboost algorithm in Python using Scikit Learn library. This is an ensemble learning technique and we will use AdaBoostClassifier to solve IRIS dataset problem. 

Step 1: Import the required Python libraries

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

Step 2: Load and examine the dataset

dataset = datasets.load_iris()
dataset.feature_names
dataset.target_names
dataset.data.shape
dataset.target.shape
dataset.data[0:5]
dataset.target[0:5]

Step 3: Mention X and Y axis

X = dataset.data
y = dataset.target

Step 4: Split the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Step 5: Create and fit the model

model = AdaBoostClassifier(n_estimators=50, learning_rate=1)
model.fit(X_train, y_train)

Step 6: Predict from the model

y_pred = model.predict(X_test)

Step 7: Check the accuracy

confusionMatrix = confusion_matrix(y_test, y_pred)
accuracyScore = accuracy_score(y_test, y_pred)
classificationReport = classification_report(y_test, y_pred)
print(confusionMatrix)
print(accuracyScore * 100)
print(classificationReport)

To learn more about Adaboost, you can refer my below posts:

Difference between Random Forest and AdaBoost in Machine Learning
Difference between AdaBoost and Gradient Boosting Machine (GBM)