In this post, we will implement XGBoost with K Fold Cross Validation technique using Scikit Learn library. We will use

Before going through this implementation, I highly recommend you to have a look at normal implementation of XGBoost in my previous post.

import pandas as pd

from sklearn.datasets import load_boston

dataset = load_boston()

dataset.keys()

dataset.data

dataset.target

dataset.data[0:5]

dataset.target[0:5]

dataset.data.shape

dataset.target.shape

dataset.feature_names

print(dataset.DESCR)

data.columns = dataset.feature_names

data.head()

data['PRICE'] = dataset.target

data.head()

data.info()

data.describe()

X, y = data.iloc[:,:-1], data.iloc[:,-1]

data_dmatrix = xgb.DMatrix(data=X, label=y)

Lets create a hyper-parameter dictionary

params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,

'max_depth': 5, 'alpha': 10}

Explanation of above statement:

cv_results.head()

print((cv_results["test-rmse-mean"]).tail(1))

I get the RMSE for the price prediction around 4.03 per 1000$. You can reach an even lower RMSE for a different set of hyper-parameters. You may consider applying techniques like Grid Search, Random Search and Bayesian Optimization to reach the optimal set of hyper-parameters.

Implement XGBoost in Python using Scikit Learn Library

Difference between GBM and XGBoost

Advantages of XGBoost Algorithm

**cv()**method which is present under**xgboost**in Scikit Learn library. You need to pass**nfold**parameter to**cv()**method which represents the number of cross validations you want to run on your dataset.Before going through this implementation, I highly recommend you to have a look at normal implementation of XGBoost in my previous post.

**Step 1: Import the required Python libraries like pandas and sklearn**import pandas as pd

from sklearn.datasets import load_boston

**import xgboost as xgb****Step 2: Load and examine the dataset (Data Exploration)**dataset = load_boston()

dataset.keys()

dataset.data

dataset.target

dataset.data[0:5]

dataset.target[0:5]

dataset.data.shape

dataset.target.shape

dataset.feature_names

print(dataset.DESCR)

*#convert the loaded dataset from scikit learn library to pandas library**data = pd.DataFrame(dataset.data)*

data.columns = dataset.feature_names

data.head()

data['PRICE'] = dataset.target

data.head()

data.info()

data.describe()

*Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.*

*I have not added details of the above exploratory methods. These are very simple methods and I am leaving it up to you to practice and explore yourself.***Step 3: Mention X and Y axis**X, y = data.iloc[:,:-1], data.iloc[:,-1]

*X contains the list of attributes**Y contains the list of labels***Step 4: Convert dataset into DMatrix***Lets convert our dataset into an optimized data structure called DMatrix that XGBoost supports and delivers high performance and efficiency gains.*data_dmatrix = xgb.DMatrix(data=X, label=y)

**Step 5: Create the model**Lets create a hyper-parameter dictionary

**params**which holds all the hyper-parameters and their values as key-value pairs but will exclude the**n_estimators**from the hyper-parameter dictionary because we will use**num_boost_rounds**instead.params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,

'max_depth': 5, 'alpha': 10}

**cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,****num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)**Explanation of above statement:

**nfold = 3:**we want to run 3 Fold Cross Validation.**metrics = rmse:**we want to use root mean square error to check the accuracy.**num_boost_round = 50:**number of trees you want to build (analogous to**n_estimators**)**early_stopping_rounds = 10:**finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.**as_pandas**: returns the results in a pandas DataFrame. So,**cv_results**is an pandas**DataFrame in above statement.****seed**: for reproducibility of results.**Step 6: Examine the results**cv_results.head()

print((cv_results["test-rmse-mean"]).tail(1))

I get the RMSE for the price prediction around 4.03 per 1000$. You can reach an even lower RMSE for a different set of hyper-parameters. You may consider applying techniques like Grid Search, Random Search and Bayesian Optimization to reach the optimal set of hyper-parameters.

**Related:**Implement XGBoost in Python using Scikit Learn Library

Difference between GBM and XGBoost

Advantages of XGBoost Algorithm

## No comments:

## Post a Comment