XGBoost is an implementation of Gradient Boosting Machine. XGBoost is an optimized and regularized version of GBM. In this post, we will try to build a model using XGBRegressor to predict the prices using Boston dataset. To know more about XGBoost and GBM, please consider visiting this post.
You can download my Jupyter notebook implementing XGBoost from here.
Step 1: Import the required Python libraries like pandas, numpy and sklearn
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
Step 2: Load and examine the dataset (Data Exploration)
dataset = load_boston()
dataset.keys()
dataset.data
dataset.target
dataset.data[0:5]
dataset.target[0:5]
dataset.data.shape
dataset.target.shape
dataset.feature_names
print(dataset.DESCR)
#convert the loaded dataset from scikit learn library to pandas library
data = pd.DataFrame(dataset.data)
data.columns = dataset.feature_names
data.head()
data['PRICE'] = dataset.target
data.head()
data.info()
data.describe()
Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.
Step 3: Mention X and Y axis
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X contains the list of attributes
Y contains the list of labels
Step 4: Split the dataset into training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)
Step 5: Create and fit the model
model = XGBRegressor(objective='reg:linear', colsample_bytree=0.3, learning_rate=0.1, max_depth=5, alpha=10, n_estimators=10)
model.fit(X_train, y_train)
I will write a separate post explaining above hyperparameters of XGBoost algorithm.
Step 6: Predict from the model
y_pred = model.predict(X_test)
The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.
Lets see the difference between the actual and predicted values.
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
Step 7: Check the accuracy
meanAbsoluteError = mean_absolute_error(y_test, y_pred)
meanSquaredError = mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print('Mean Absolute Error:', meanAbsoluteError)
print('Mean Squared Error:', meanSquaredError)
print('Root Mean Squared Error:', rootMeanSquaredError)
You can download my Jupyter notebook implementing XGBoost from here.
Step 1: Import the required Python libraries like pandas, numpy and sklearn
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
Step 2: Load and examine the dataset (Data Exploration)
dataset = load_boston()
dataset.keys()
dataset.data
dataset.target
dataset.data[0:5]
dataset.target[0:5]
dataset.data.shape
dataset.target.shape
dataset.feature_names
print(dataset.DESCR)
#convert the loaded dataset from scikit learn library to pandas library
data = pd.DataFrame(dataset.data)
data.columns = dataset.feature_names
data.head()
data['PRICE'] = dataset.target
data.head()
data.info()
data.describe()
Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.
Step 3: Mention X and Y axis
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X contains the list of attributes
Y contains the list of labels
Step 4: Split the dataset into training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)
Step 5: Create and fit the model
model = XGBRegressor(objective='reg:linear', colsample_bytree=0.3, learning_rate=0.1, max_depth=5, alpha=10, n_estimators=10)
model.fit(X_train, y_train)
I will write a separate post explaining above hyperparameters of XGBoost algorithm.
Step 6: Predict from the model
y_pred = model.predict(X_test)
The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.
Lets see the difference between the actual and predicted values.
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
Step 7: Check the accuracy
meanAbsoluteError = mean_absolute_error(y_test, y_pred)
meanSquaredError = mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print('Mean Absolute Error:', meanAbsoluteError)
print('Mean Squared Error:', meanSquaredError)
print('Root Mean Squared Error:', rootMeanSquaredError)
No comments:
Post a Comment