Pages

Saturday, 9 March 2019

Implement XGBoost in Python using Scikit Learn Library in Machine Learning

XGBoost is an implementation of Gradient Boosting Machine. XGBoost is an optimized and regularized version of GBM. In this post, we will try to build a model using XGBRegressor to predict the prices using Boston dataset. To know more about XGBoost and GBM, please consider visiting this post.

You can download my Jupyter notebook implementing XGBoost from here.

Step 1: Import the required Python libraries like pandas, numpy and sklearn

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error  

Step 2: Load and examine the dataset (Data Exploration)

dataset = load_boston()
dataset.keys()
dataset.data
dataset.target
dataset.data[0:5]
dataset.target[0:5]
dataset.data.shape
dataset.target.shape
dataset.feature_names
print(dataset.DESCR)

#convert the loaded dataset from scikit learn library to pandas library

data = pd.DataFrame(dataset.data)
data.columns = dataset.feature_names
data.head()
data['PRICE'] = dataset.target
data.head()
data.info()
data.describe() 

Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.

Step 3: Mention X and Y axis

X, y = data.iloc[:,:-1], data.iloc[:,-1]

X contains the list of attributes
Y contains the list of labels

Step 4: Split the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0) 

Step 5: Create and fit the model

model = XGBRegressor(objective='reg:linear', colsample_bytree=0.3, learning_rate=0.1, max_depth=5, alpha=10, n_estimators=10)

model.fit(X_train, y_train)  

I will write a separate post explaining above hyperparameters of XGBoost algorithm.

Step 6: Predict from the model

y_pred = model.predict(X_test)

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.

Lets see the difference between the actual and predicted values.

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df 

Step 7: Check the accuracy

meanAbsoluteError = mean_absolute_error(y_test, y_pred)
meanSquaredError = mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print('Mean Absolute Error:', meanAbsoluteError)  
print('Mean Squared Error:', meanSquaredError)  
print('Root Mean Squared Error:', rootMeanSquaredError)

No comments:

Post a Comment