Sunday, 10 March 2019

Implement XGBoost with K Fold Cross Validation in Python using Scikit Learn Library

In this post, we will implement XGBoost with K Fold Cross Validation technique using Scikit Learn library. We will use cv() method which is present under xgboost in Scikit Learn library. You need to pass nfold parameter to cv() method which represents the number of cross validations you want to run on your dataset. 

Before going through this implementation, I highly recommend you to have a look at normal implementation of XGBoost in my previous post.

Step 1: Import the required Python libraries like pandas and sklearn

import pandas as pd
from sklearn.datasets import load_boston
import xgboost as xgb

Step 2: Load and examine the dataset (Data Exploration)

dataset = load_boston()
dataset.keys()
dataset.data
dataset.target
dataset.data[0:5]
dataset.target[0:5]
dataset.data.shape
dataset.target.shape
dataset.feature_names
print(dataset.DESCR)

#convert the loaded dataset from scikit learn library to pandas library

data = pd.DataFrame(dataset.data)
data.columns = dataset.feature_names
data.head()
data['PRICE'] = dataset.target
data.head()
data.info()
data.describe() 

Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.

I have not added details of the above exploratory methods. These are very simple methods and I am leaving it up to you to practice and explore yourself.

Step 3: Mention X and Y axis

X, y = data.iloc[:,:-1], data.iloc[:,-1]

X contains the list of attributes
Y contains the list of labels

Step 4: Convert dataset into DMatrix

Lets convert our dataset into an optimized data structure called DMatrix that XGBoost supports and delivers high performance and efficiency gains. 

data_dmatrix = xgb.DMatrix(data=X, label=y)

Step 5: Create the model

Lets create a hyper-parameter dictionary params which holds all the hyper-parameters and their values as key-value pairs but will exclude the n_estimators from the hyper-parameter dictionary because we will use num_boost_rounds instead.

params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)  

Explanation of above statement:

nfold = 3: we want to run 3 Fold Cross Validation.
metrics = rmse: we want to use root mean square error to check the accuracy.
num_boost_round = 50: number of trees you want to build (analogous to n_estimators)
early_stopping_rounds = 10: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.
as_pandas: returns the results in a pandas DataFrame. So, cv_results is an pandas DataFrame in above statement.
seed: for reproducibility of results.

Step 6: Examine the results

cv_results.head()
print((cv_results["test-rmse-mean"]).tail(1))

I get the RMSE for the price prediction around 4.03 per 1000$. You can reach an even lower RMSE for a different set of hyper-parameters. You may consider applying techniques like Grid Search, Random Search and Bayesian Optimization to reach the optimal set of hyper-parameters.

Related:
Implement XGBoost in Python using Scikit Learn Library
Difference between GBM and XGBoost
Advantages of XGBoost Algorithm

No comments:

Post a Comment