Pages

Saturday, 23 February 2019

Implement Random Forest Algorithm in Python using Scikit Learn Library for Classification Problem

Random Forest is a bagging algorithm based on Ensemble Learning technique. The random forest algorithm can be used for both classification and regression problems. 

In this article, we will solve a classification problem (bank note authentication) using Random Forest. We need to import RandomForestClassifier from sklearn library to implement Random Forest.

You can download bank_note_authentication.csv from here. You can also download my Jupyter notebook containing below code of Random Forest implementation.

Step 1: Import the required Python libraries like pandas and sklearn

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

Step 2: Load and examine the dataset

names = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']
dataset = pd.read_csv('bank_note_authentication.csv', names=names)
dataset.shape
dataset.head()

Step 3: Mention X and Y axis

X = dataset.iloc[:, 0:4].values  
y = dataset.iloc[:, 4].values 

Step 4: Split the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20) 

Step 5: Scale the features

standardScaler = StandardScaler()  
X_train = standardScaler.fit_transform(X_train)  
X_test = standardScaler.transform(X_test) 

This step is not must for Random Forest as it is being taken care by Random Forest internally. Feature scaling is not required in tree based algorithms.

Step 6: Create and fit the model

model =  RandomForestClassifier(n_estimators=20, random_state=0) 
model.fit(X_train, y_train)  

"n_estimators" is the number of trees we want to create in a Random Forest. By default, it is 100.

Step 7: Predict from the model

y_pred = model.predict(X_test)

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.

Lets see the difference between the actual and predicted values.

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df  

Step 8: Check the accuracy

confusionMatrix = confusion_matrix(y_test, y_pred)
accuracyScore = accuracy_score(y_test, y_pred)
classificationReport = classification_report(y_test, y_pred)
print(confusionMatrix)
print(accuracyScore * 100)
print(classificationReport)

No comments:

Post a Comment