Tuesday, 5 March 2019

Implement PCA in Python using Scikit Learn Library

We calculate Principal Components on a dataset using the PCA() class in the Scikit-Learn library. 

While creating the PCA() class, we can pass following parameters in the constructor:

1. Number of Principal Components we need to consider OR
2. Amount of Variance we need to retain

If we don't pass any parameter in the constructor, all the Principal Components (which are usually equal to number of features in the dataset) are used to create a model. We should refrain from this approach as this will not serve the purpose of PCA.

We will see it in detail in this article. We will use bank note authentication dataset and implement Random Forest to identify the authenticity of the currency. For theory on PCA, you can go through this article.

Step 1: Import the required Python libraries like pandas, numpy and sklearn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

Step 2: Load and examine the dataset

names = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']
dataset = pd.read_csv("C:\\datasets\\bank_note_authentication.csv", names=names)
dataset.shape
dataset.head()

Step 3: Mention X and Y axis

X = dataset.drop('Class', axis=1)  
y = dataset['Class']    

X contains the list of attributes
Y contains the list of labels

Step 4: Split the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0) 

Step 5: Feature scaling

standardScaler = StandardScaler()  
X_train = standardScaler.fit_transform(X_train)  
X_test = standardScaler.transform(X_test)

Step 6.1: Apply PCA (Method 1)

pca = PCA()  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

explained_variance = pca.explained_variance_ratio_
print(explained_variance)

Output:
[0.54578721 0.31931922 0.09136961 0.04352396]

Note: Here I have not passed any parameter to the PCA class constructor. So, it will return 4 Principal Components (equal to the number of features in our dataset). 

explained_variance_ratio_ returns the variance delivered by each Principal Component.

From the output, you can see first PC accounts for 54% of variance and second PC accounts for around 32% of variance. So, both Principal Components account for around 86% of the variance in the dataset. So, instead of using all the four features, we can use only two or three principal component to build our model. 

This example does not highlight the great importance of PCA as we have only 4 features in our dataset. But in real world we can easily have 40 features or 40K features or more. In these scenarios, PCA does a fantastic job. In these cases, instead of using 40K features, we will need to just use some hundreds of Principal Components which will drastically increase the performance of our model.

Step 6.2: Apply PCA (Method 2)

pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)
PC_DataFrame = pd.DataFrame(data = X_train, columns = ['PC1', 'PC2'])
print(PC_DataFrame)

Note: From the step 6.1, it is clear that if we use only two Principal Components, we can still cover 86% of the variance. So, I passed number of components as 2 in the constructor. 

Step 6.3: Apply PCA (Method 3)

pca = PCA(0.86) 
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)
print(pca.components_)

Note: This is the another way of doing PCA on the dataset. If I want to retain 86% of variance in my dataset and don't want to bother about the number of Principal Components, I can use this approach.

components_ returns the number of Principal Components considered to achieve the variance of 86%. In this case, number of Principal Components are 2.

So, we can use either 6.2 or 6.3 approach to implement PCA.

The number of Principal Components to retain in a feature set depends on several conditions such as storage capacity, training time, performance etc. In some datasets all the features are contributing equally to the overall variance, therefore all the principal components are crucial to the predictions and none can be ignored. A general rule of thumb is to take number of principal of Principal Components that contribute to significant variance and ignore those with diminishing variance returns. 

Step 7: Create and fit the model

model = RandomForestClassifier(max_depth=2, random_state=0)  
model.fit(X_train, y_train)

Step 8: Predict from the model

y_pred = model.predict(X_test) 

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.

Lets see the difference between the actual and predicted values.

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df 

Step 9: Check the accuracy

confusionMatrix = confusion_matrix(y_test, y_pred)
accuracyScore = accuracy_score(y_test, y_pred)
classificationReport = classification_report(y_test, y_pred)
print(confusionMatrix)
print(accuracyScore * 100)
print(classificationReport)

No comments:

Post a Comment