We calculate Principal Components on a dataset using the

While creating the PCA() class, we can pass following parameters in the constructor:

1.

2.

If we don't pass any parameter in the constructor, all the Principal Components (which are usually equal to number of features in the dataset) are used to create a model. We should refrain from this approach as this will not serve the purpose of PCA.

We will see it in detail in this article. We will use bank note authentication dataset and implement Random Forest to identify the authenticity of the currency. For theory on PCA, you can go through this article.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import

from sklearn.decomposition import

from sklearn.ensemble import

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

names = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']

dataset = pd.read_csv('bank_note_authentication.csv', names=names)

dataset.shape

dataset.head()

X = dataset.drop('Class', axis=1)

y = dataset['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

standardScaler = StandardScaler()

X_train = standardScaler.fit_transform(X_train)

X_test = standardScaler.transform(X_test)

X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)

explained_variance =

print(explained_variance)

[0.54578721 0.31931922 0.09136961 0.04352396]

From the output, you can see that first PC accounts for 54% of variance and second PC accounts for around 32% of variance. So, both Principal Components account for around 86% of the variance in the dataset. So, instead of using all the four features, we can use only two or three principal component to build our model.

This example does not highlight the great importance of PCA as we have only 4 features in our dataset. But in real world, we can easily have 40 features or 40K features or more. In these scenarios, PCA does a fantastic job. In these cases, instead of using 40K features, we will need to just use some hundreds of Principal Components which will drastically increase the performance of our model.

X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)

PC_DataFrame = pd.DataFrame(data = X_train, columns = ['PC1', 'PC2'])

print(PC_DataFrame)

X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)

print(

So, we can use either 6.2 or 6.3 approach to implement PCA.

The number of Principal Components to retain in a feature set depends on several conditions such as storage capacity, training time, performance etc. In some datasets, where all the features are contributing equally to the overall variance, all the principal components are crucial to the predictions and none can be ignored. A general rule of thumb is to take number of Principal Components that contribute to significant variance and ignore those with diminishing variance returns.

model =

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

df

confusionMatrix = confusion_matrix(y_test, y_pred)

accuracyScore = accuracy_score(y_test, y_pred)

classificationReport = classification_report(y_test, y_pred)

print(confusionMatrix)

print(accuracyScore * 100)

print(classificationReport)

**PCA()**class in the Scikit-Learn library.While creating the PCA() class, we can pass following parameters in the constructor:

1.

**Number of Principal Components**we need to consider OR2.

**Amount of Variance**we need to retainIf we don't pass any parameter in the constructor, all the Principal Components (which are usually equal to number of features in the dataset) are used to create a model. We should refrain from this approach as this will not serve the purpose of PCA.

We will see it in detail in this article. We will use bank note authentication dataset and implement Random Forest to identify the authenticity of the currency. For theory on PCA, you can go through this article.

**You can download the dataset from here and my Jupyter notebook implementing PCA from here.**

**Step 1: Import the required Python libraries like pandas, numpy and sklearn**import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import

**StandardScaler**from sklearn.decomposition import

**PCA**from sklearn.ensemble import

**RandomForestClassifier**from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

**Step 2: Load and examine the dataset**names = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']

dataset = pd.read_csv('bank_note_authentication.csv', names=names)

dataset.shape

dataset.head()

**Step 3: Mention X and Y axis**X = dataset.drop('Class', axis=1)

y = dataset['Class']

*X contains the list of attributes**Y contains the list of labels***Step 4: Split the dataset into training and testing dataset**X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

**Step 5: Feature scaling**standardScaler = StandardScaler()

X_train = standardScaler.fit_transform(X_train)

X_test = standardScaler.transform(X_test)

**Step 6.1: Apply PCA (Method 1)****pca = PCA()**X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)

explained_variance =

**pca.explained_variance_ratio_**print(explained_variance)

**Output:**[0.54578721 0.31931922 0.09136961 0.04352396]

**Note:**Here I have not passed any parameter to the PCA class constructor. So, it will return 4 Principal Components (equal to the number of features in our dataset).**explained_variance_ratio_**returns the variance delivered by each Principal Component.From the output, you can see that first PC accounts for 54% of variance and second PC accounts for around 32% of variance. So, both Principal Components account for around 86% of the variance in the dataset. So, instead of using all the four features, we can use only two or three principal component to build our model.

This example does not highlight the great importance of PCA as we have only 4 features in our dataset. But in real world, we can easily have 40 features or 40K features or more. In these scenarios, PCA does a fantastic job. In these cases, instead of using 40K features, we will need to just use some hundreds of Principal Components which will drastically increase the performance of our model.

**Step 6.2: Apply PCA (Method 2)****pca = PCA(n_components=2)**X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)

PC_DataFrame = pd.DataFrame(data = X_train, columns = ['PC1', 'PC2'])

print(PC_DataFrame)

**Note:**From the step 6.1, it is clear that if we use only two Principal Components, we can still cover 86% of the variance. So, I passed number of components as 2 in the constructor.**Step 6.3: Apply PCA (Method 3)****pca = PCA(0.86)**X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)

print(

**pca.components_**)**Note:**This is the another way of doing PCA on the dataset. If I want to retain 86% of variance in my dataset and don't want to bother about the number of Principal Components, I can use this approach.**components_**returns the number of Principal Components considered to achieve the variance of 86%. In this case, number of Principal Components are 2.So, we can use either 6.2 or 6.3 approach to implement PCA.

The number of Principal Components to retain in a feature set depends on several conditions such as storage capacity, training time, performance etc. In some datasets, where all the features are contributing equally to the overall variance, all the principal components are crucial to the predictions and none can be ignored. A general rule of thumb is to take number of Principal Components that contribute to significant variance and ignore those with diminishing variance returns.

**Step 7: Create and fit the model**model =

**RandomForestClassifier**(max_depth=2, random_state=0)model.fit(X_train, y_train)

**Step 8: Predict from the model**y_pred = model.predict(X_test)

*The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.**Lets see the difference between the actual and predicted values.*df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

df

**Step 9: Check the accuracy**confusionMatrix = confusion_matrix(y_test, y_pred)

accuracyScore = accuracy_score(y_test, y_pred)

classificationReport = classification_report(y_test, y_pred)

print(confusionMatrix)

print(accuracyScore * 100)

print(classificationReport)

## No comments:

## Post a Comment