In this post, we will implement XGBoost with K Fold Cross Validation technique using Scikit Learn library. We will use

Before going through this implementation, I highly recommend you to have a look at normal implementation of XGBoost in my previous post.

You can download my Jupyter notebook implementing XGBoost using Cross Validation from here.

import pandas as pd

from sklearn.datasets import load_boston

dataset = load_boston()

dataset.keys()

dataset.data

dataset.target

dataset.data[0:5]

dataset.target[0:5]

dataset.data.shape

dataset.target.shape

dataset.feature_names

print(dataset.DESCR)

data.columns = dataset.feature_names

data.head()

data['PRICE'] = dataset.target

data.head()

data.info()

data.describe()

X, y = data.iloc[:,:-1], data.iloc[:,-1]

data_dmatrix = xgb.

Lets create a hyper-parameter dictionary

params = {'objective':'reg:linear",'colsample_bytree':0.3,'learning_rate':0.1,

'max_depth':5, 'alpha':10}

Explanation of above statement:

cv_results.head()

print((cv_results["test-rmse-mean"]).tail())

I get the RMSE for the price prediction around 3.67 per 1000$. You can reach an even lower RMSE for a different set of hyper-parameters. You may consider applying techniques like Grid Search, Random Search and Bayesian Optimization to reach the optimal set of hyper-parameters.

Implement XGBoost in Python using Scikit Learn Library

Difference between GBM and XGBoost

Advantages of XGBoost Algorithm

**cv()**method which is present under**xgboost**in Scikit Learn library. You need to pass**nfold**parameter to**cv()**method which represents the number of cross validations you want to run on your dataset.Before going through this implementation, I highly recommend you to have a look at normal implementation of XGBoost in my previous post.

You can download my Jupyter notebook implementing XGBoost using Cross Validation from here.

**Step 1: Import the required Python libraries like pandas and sklearn**import pandas as pd

from sklearn.datasets import load_boston

**import xgboost as xgb****Step 2: Load and examine the dataset (Data Exploration)**dataset = load_boston()

dataset.keys()

dataset.data

dataset.target

dataset.data[0:5]

dataset.target[0:5]

dataset.data.shape

dataset.target.shape

dataset.feature_names

print(dataset.DESCR)

*#convert the loaded dataset from scikit learn library to pandas library**data = pd.DataFrame(dataset.data)*

data.columns = dataset.feature_names

data.head()

data['PRICE'] = dataset.target

data.head()

data.info()

data.describe()

*Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.*

**Step 3: Mention X and Y axis**X, y = data.iloc[:,:-1], data.iloc[:,-1]

*X contains the list of attributes**Y contains the list of labels***Step 4: Convert dataset into DMatrix***Lets convert our dataset into an optimized data structure called DMatrix that XGBoost supports and delivers high performance and efficiency gains.*data_dmatrix = xgb.

**DMatrix**(data=X, label=y)**Step 5: Create the model**Lets create a hyper-parameter dictionary

**params**which holds all the hyper-parameters and their values as key-value pairs. We will exclude**n_estimators**from the hyper-parameter dictionary because we will use**num_boost_rounds**instead.params = {'objective':'reg:linear",'colsample_bytree':0.3,'learning_rate':0.1,

'max_depth':5, 'alpha':10}

**cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=10,****num_boost_round=50, early_stopping_rounds=10, metrics='rmse", as_pandas=True, seed=0)**Explanation of above statement:

**nfold = 10:**we want to run 10 Fold Cross Validation.**metrics = rmse:**we want to use root mean square error to check the accuracy.**num_boost_round = 50:**number of trees you want to build (analogous to**n_estimators**)**early_stopping_rounds = 10:**finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.**as_pandas**: returns the results in a pandas data frame. So,**cv_results**is a pandas**data frame in the above statement.****seed**: for reproducibility of results.**Step 6: Examine the results**cv_results.head()

print((cv_results["test-rmse-mean"]).tail())

I get the RMSE for the price prediction around 3.67 per 1000$. You can reach an even lower RMSE for a different set of hyper-parameters. You may consider applying techniques like Grid Search, Random Search and Bayesian Optimization to reach the optimal set of hyper-parameters.

**Related:**Implement XGBoost in Python using Scikit Learn Library

Difference between GBM and XGBoost

Advantages of XGBoost Algorithm

## No comments:

## Post a Comment