Random Forest is a bagging algorithm based on Ensemble Learning technique. The Random Forest algorithm can be used for both classification and regression problems.
In last article, we had solved a classification problem using Random Forest. In this article, we will solve a regression problem (predicting the petrol consumption in US) using Random Forest. We need to import RandomForestRegressor instead of RandomForestClassifier from sklearn library to implement Random Forest.
To measure the performance of a regression problem, we need to import mean_absolute_error and mean_squared_error metrics instead of confusion_matrix, accuracy_score and classification_report which we used in classification problem.
You can download petrol_consumption.csv from here. You can also download my Jupyter notebook containing below code of Random Forest implementation.
Step 1: Import the required Python libraries like pandas, numpy and sklearn
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
Step 2: Load and examine the dataset
names = ['Index', 'One', 'Petrol tax (cents per gallon)', 'Average income (dollars)',
'Paved Highways (miles)', 'Proportion of population with driver licenses',
'Consumption of petrol (millions of gallons)']
dataset = pd.read_csv('petrol_consumption.csv', names=names)
dataset.shape
dataset.head()
dataset.describe()
Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.
Step 3: Mention X and Y axis
X = dataset.iloc[:, 2:6].values
y = dataset.iloc[:, 6].values
Please note that first two columns "Index" and "One" are of no use for making any prediction. So excluded these two features. Also excluded the label which is the last column.
X contains the list of attributes
Y contains the list of labels
Step 4: Split the dataset into training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)
Step 5: Scale the features
standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)
This step is not must for Random Forest as it is being taken care by Random Forest internally. Feature scaling is not required in tree based algorithms.
Step 6: Create and fit the model
model = RandomForestRegressor(n_estimators=120, random_state=0)
model.fit(X_train, y_train)
"n_estimators" is the number of trees we want to create in a Random Forest. By default, it is 100.
Step 7: Predict from the model
y_pred = model.predict(X_test)
The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.
Lets see the difference between the actual and predicted values.
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
Step 8: Check the accuracy
meanAbsoluteError = mean_absolute_error(y_test, y_pred)
meanSquaredError = mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print('Mean Absolute Error:', meanAbsoluteError)
print('Mean Squared Error:', meanSquaredError)
print('Root Mean Squared Error:', rootMeanSquaredError)
In last article, we had solved a classification problem using Random Forest. In this article, we will solve a regression problem (predicting the petrol consumption in US) using Random Forest. We need to import RandomForestRegressor instead of RandomForestClassifier from sklearn library to implement Random Forest.
To measure the performance of a regression problem, we need to import mean_absolute_error and mean_squared_error metrics instead of confusion_matrix, accuracy_score and classification_report which we used in classification problem.
You can download petrol_consumption.csv from here. You can also download my Jupyter notebook containing below code of Random Forest implementation.
Step 1: Import the required Python libraries like pandas, numpy and sklearn
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
Step 2: Load and examine the dataset
names = ['Index', 'One', 'Petrol tax (cents per gallon)', 'Average income (dollars)',
'Paved Highways (miles)', 'Proportion of population with driver licenses',
'Consumption of petrol (millions of gallons)']
dataset = pd.read_csv('petrol_consumption.csv', names=names)
dataset.shape
dataset.head()
dataset.describe()
Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.
Step 3: Mention X and Y axis
X = dataset.iloc[:, 2:6].values
y = dataset.iloc[:, 6].values
Please note that first two columns "Index" and "One" are of no use for making any prediction. So excluded these two features. Also excluded the label which is the last column.
X contains the list of attributes
Y contains the list of labels
Step 4: Split the dataset into training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)
Step 5: Scale the features
standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)
This step is not must for Random Forest as it is being taken care by Random Forest internally. Feature scaling is not required in tree based algorithms.
Step 6: Create and fit the model
model = RandomForestRegressor(n_estimators=120, random_state=0)
model.fit(X_train, y_train)
"n_estimators" is the number of trees we want to create in a Random Forest. By default, it is 100.
Step 7: Predict from the model
y_pred = model.predict(X_test)
The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.
Lets see the difference between the actual and predicted values.
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
Step 8: Check the accuracy
meanAbsoluteError = mean_absolute_error(y_test, y_pred)
meanSquaredError = mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print('Mean Absolute Error:', meanAbsoluteError)
print('Mean Squared Error:', meanSquaredError)
print('Root Mean Squared Error:', rootMeanSquaredError)
No comments:
Post a Comment