The Professionals Point: Implement Random Forest Algorithm in Python using Scikit Learn Library for Classification Problem

Saturday, 23 February 2019

Implement Random Forest Algorithm in Python using Scikit Learn Library for Classification Problem

Random Forest is a bagging algorithm based on Ensemble Learning technique. The random forest algorithm can be used for both classification and regression problems.

In this article, we will solve a classification problem (bank note authentication) using Random Forest. We need to import RandomForestClassifier from sklearn library to implement Random Forest.

You can download bank_note_authentication.csv from here. You can also download my Jupyter notebook containing below code of Random Forest implementation.

Step 1: Import the required Python libraries like pandas and sklearn

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

Step 2: Load and examine the dataset

names = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']
dataset = pd.read_csv('bank_note_authentication.csv', names=names)
dataset.shape
dataset.head()

Step 3: Mention X and Y axis

X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

Step 4: Split the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

Step 5: Scale the features

standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)

This step is not must for Random Forest as it is being taken care by Random Forest internally. Feature scaling is not required in tree based algorithms.

Step 6: Create and fit the model

model = RandomForestClassifier(n_estimators=20, random_state=0)
model.fit(X_train, y_train)

"n_estimators" is the number of trees we want to create in a Random Forest. By default, it is 100.

Step 7: Predict from the model

y_pred = model.predict(X_test)

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.

Lets see the difference between the actual and predicted values.

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

Step 8: Check the accuracy

confusionMatrix = confusion_matrix(y_test, y_pred)
accuracyScore = accuracy_score(y_test, y_pred)
classificationReport = classification_report(y_test, y_pred)
print(confusionMatrix)
print(accuracyScore * 100)
print(classificationReport)

The Professionals Point

Pages

Saturday, 23 February 2019

Implement Random Forest Algorithm in Python using Scikit Learn Library for Classification Problem

No comments:

Post a Comment

About the Author