Pages

Monday, 18 February 2019

Implement KNN Algorithm in Python using Scikit Learn Library

KNN is a very simple classification algorithm in Machine Learning. It can be easily implemented in Python using Scikit Learn library. You need to import KNeighborsClassifier from sklearn to create a model using KNN algorithm. Lets create a KNN model in Python using Scikit Learn library. I will use popular and simple IRIS dataset to implement KNN in Python.

You can also download my Jupyter notebook containing below code of KNN implementation.

Step 1: Import the required Python libraries like pandas and sklearn

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

Step 2: Load and examine the dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']
dataset = pd.read_csv(url, names=names) 

dataset.shape
dataset.head()

Step 3: Mention X and Y axis

X = dataset.iloc[:, :-1].values  
y = dataset.iloc[:, 4].values 

Step 4: Split the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Step 5: Use StandardScaler to scale the values in the dataset

standardScaler = StandardScaler()  
standardScaler.fit(X_train)
X_train = standardScaler.transform(X_train)  
X_test = standardScaler.transform(X_test)  

Note: To get detailed information about standardization, please visit my this post.

Step 6: Create and fit the model

model = KNeighborsClassifier(n_neighbors=11)  
model.fit(X_train, y_train) 

Note: Why have I taken 11 neighbors? Why K = 11? This dataset has 150 rows (data points). I split the dataset into 4:1 ratio for training and test. So, now training dataset contains 120 rows. So, for 120 data points, I took square root of 120, which came around 10.95. So, I took 11 as the value of K. You can also use cross-validation to get the optimum value of K.  

Step 7: Predict from the model

y_pred = model.predict(X_test)  

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.

Lets see the difference between the actual and predicted values.

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df 

Step 8: Check the accuracy

confusionMatrix = confusion_matrix(y_test, y_pred)
accuracyScore = accuracy_score(y_test, y_pred)
classificationReport = classification_report(y_test, y_pred)
print(confusionMatrix)
print(accuracyScore * 100)
print(classificationReport)

No comments:

Post a Comment