KNN is a very simple classification algorithm in Machine Learning. It can be easily implemented in Python using Scikit Learn library. You need to import KNeighborsClassifier from sklearn to create a model using KNN algorithm. Lets create a KNN model in Python using Scikit Learn library. I will use popular and simple IRIS dataset to implement KNN in Python.
You can also download my Jupyter notebook containing below code of KNN implementation.
Step 1: Import the required Python libraries like pandas and sklearn
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
Step 2: Load and examine the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']
dataset = pd.read_csv(url, names=names)
dataset.shape
dataset.head()
Step 3: Mention X and Y axis
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Step 4: Split the dataset into training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Step 5: Use StandardScaler to scale the values in the dataset
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train = standardScaler.transform(X_train)
X_test = standardScaler.transform(X_test)
Note: To get detailed information about standardization, please visit my this post.
Step 6: Create and fit the model
model = KNeighborsClassifier(n_neighbors=11)
model.fit(X_train, y_train)
Note: Why have I taken 11 neighbors? Why K = 11? This dataset has 150 rows (data points). I split the dataset into 4:1 ratio for training and test. So, now training dataset contains 120 rows. So, for 120 data points, I took square root of 120, which came around 10.95. So, I took 11 as the value of K. You can also use cross-validation to get the optimum value of K.
Step 7: Predict from the model
y_pred = model.predict(X_test)
The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.
Lets see the difference between the actual and predicted values.
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
Step 8: Check the accuracy
confusionMatrix = confusion_matrix(y_test, y_pred)
accuracyScore = accuracy_score(y_test, y_pred)
classificationReport = classification_report(y_test, y_pred)
print(confusionMatrix)
print(accuracyScore * 100)
print(classificationReport)
You can also download my Jupyter notebook containing below code of KNN implementation.
Step 1: Import the required Python libraries like pandas and sklearn
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
Step 2: Load and examine the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']
dataset = pd.read_csv(url, names=names)
dataset.shape
dataset.head()
Step 3: Mention X and Y axis
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Step 4: Split the dataset into training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Step 5: Use StandardScaler to scale the values in the dataset
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train = standardScaler.transform(X_train)
X_test = standardScaler.transform(X_test)
Note: To get detailed information about standardization, please visit my this post.
Step 6: Create and fit the model
model = KNeighborsClassifier(n_neighbors=11)
model.fit(X_train, y_train)
Note: Why have I taken 11 neighbors? Why K = 11? This dataset has 150 rows (data points). I split the dataset into 4:1 ratio for training and test. So, now training dataset contains 120 rows. So, for 120 data points, I took square root of 120, which came around 10.95. So, I took 11 as the value of K. You can also use cross-validation to get the optimum value of K.
Step 7: Predict from the model
y_pred = model.predict(X_test)
The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.
Lets see the difference between the actual and predicted values.
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
Step 8: Check the accuracy
confusionMatrix = confusion_matrix(y_test, y_pred)
accuracyScore = accuracy_score(y_test, y_pred)
classificationReport = classification_report(y_test, y_pred)
print(confusionMatrix)
print(accuracyScore * 100)
print(classificationReport)
No comments:
Post a Comment