Pages

Sunday, 24 February 2019

Implement Simple Linear Regression Algorithm in Python using Scikit Learn Library

Linear Regression model basically finds out the best value for the intercept and the slope, which results in a line that best fits the data. Linear Regression can be classified as Simple Linear Regression and Multiple Linear Regression

Simple Linear Regression involves only two variables. One is Attribute and another is Label. Attributes are the independent variables while labels are the dependent variables whose values are to be predicted. 

In this article, we will implement Simple Linear Regression. Lets predict the percentage of marks that a student is expected to score based upon the number of hours he studied. This is a simple linear regression task as it involves just two variables. "Hours Studied" is an independent variable (attribute) and "Percentage Scored" is a dependent variable (label) which we are going to predict using Simple Linear Regression.

You can download student_scores.csv from here. You can also download my Jupyter notebook containing below code of Simple Linear Regression.

Step 1: Import the required Python libraries like pandas, numpy and sklearn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error 

Step 2: Load and examine the dataset

dataset = pd.read_csv('student_scores.csv') 
dataset.shape
dataset.head()
dataset.describe() 

Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.

I got mean of the scores as 51.480000. I will use this value to evaluate the performance of the algorithm in step 9.

Step 3: Draw a plot between attribute and label

dataset.plot(x='Hours', y='Scores', style='o')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show() 

The plot will show a positive linear relation between the number of hours studied and percentage of score.

Step 4: Mention X and Y axis

X = dataset.iloc[:, :-1].values  
y = dataset.iloc[:, 1].values  

X contains the list of attributes
Y contains the list of labels

Step 5: Split the dataset into training and test dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0) 

Step 6: Create and fit the model

model = LinearRegression()
model.fit(X_train, y_train)  

Step 7: Print intercept and coefficient (slope) of the best fit line

print(model.intercept_) 
print(model.coef_)  

I get the intercept as 2.018160041434683 and coefficient as 9.91065648. This means that for every one unit of change in hours studied, the change in the score is about 9.91%. Or in simpler words, if a student studies one hour more than he previously studied for an exam, he can expect to achieve an increase of 9.91% in the score.

Step 8: Predict from the model

y_pred = model.predict(X_test)

The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.

Lets see the difference between the actual and predicted values.

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df 

Step 9: Check the accuracy

meanAbsoluteError = mean_absolute_error(y_test, y_pred)
meanSquaredError = mean_squared_error(y_test, y_pred)
rootMeanSquaredError = np.sqrt(meanSquaredError)
print('Mean Absolute Error:', meanAbsoluteError)  
print('Mean Squared Error:', meanSquaredError)  
print('Root Mean Squared Error:', rootMeanSquaredError)

I got Root Mean Square Error as 4.6474476121003665 which is less than 10% of the mean value of the percentages of all the students i.e. 51.48 (look at step 2). This means that our algorithm did a decent job.

No comments:

Post a Comment