Linear Regression model basically finds out the best value for the

Simple Linear Regression involves only

In this article, we will implement Simple Linear Regression. Lets predict the percentage of marks that a student is expected to score based upon the number of hours he studied. This is a simple linear regression task as it involves just two variables. "Hours Studied" is an independent variable (attribute) and "Percentage Scored" is a dependent variable (label) which we are going to predict using Simple Linear Regression.

You can download

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error, mean_squared_error

dataset = pd.read_csv('student_scores.csv')

dataset.shape

dataset.head()

dataset.describe()

dataset.plot(x='Hours', y='Scores', style='o')

plt.title('Hours vs Percentage')

plt.xlabel('Hours Studied')

plt.ylabel('Percentage Score')

plt.show()

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

model.fit(X_train, y_train)

print(model.coef_)

y_pred = model.predict(X_test)

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

df

meanAbsoluteError =

meanSquaredError =

rootMeanSquaredError = np.sqrt(meanSquaredError)

print('Mean Absolute Error:', meanAbsoluteError)

print('Mean Squared Error:', meanSquaredError)

print('Root Mean Squared Error:', rootMeanSquaredError)

**intercept**and the**slope**, which results in a line that best fits the data. Linear Regression can be classified as**Simple Linear Regression**and**Multiple Linear Regression**.Simple Linear Regression involves only

**two variables**. One is**Attribute**and another is**Label**. Attributes are the independent variables while labels are the dependent variables whose values are to be predicted.In this article, we will implement Simple Linear Regression. Lets predict the percentage of marks that a student is expected to score based upon the number of hours he studied. This is a simple linear regression task as it involves just two variables. "Hours Studied" is an independent variable (attribute) and "Percentage Scored" is a dependent variable (label) which we are going to predict using Simple Linear Regression.

You can download

**student_scores.csv**from here. You can also download my Jupyter notebook containing below code of Simple Linear Regression.

**Step 1: Import the required Python libraries like pandas, numpy and sklearn**import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.model_selection import train_test_split

**from sklearn.linear_model import LinearRegression**from sklearn.metrics import mean_absolute_error, mean_squared_error

**Step 2: Load and examine the dataset**dataset = pd.read_csv('student_scores.csv')

dataset.shape

dataset.head()

dataset.describe()

*Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.*

*I got mean of the scores as**51.480000. I will use this value to evaluate the performance of the algorithm in step 9.***Step 3: Draw a plot between attribute and label**dataset.plot(x='Hours', y='Scores', style='o')

plt.title('Hours vs Percentage')

plt.xlabel('Hours Studied')

plt.ylabel('Percentage Score')

plt.show()

*The plot will show a positive linear relation between the number of hours studied and percentage of score.*

**Step 4: Mention X and Y axis**X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 1].values

*X contains the list of attributes**Y contains the list of labels***Step 5: Split the dataset into training and test dataset**X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

**Step 6: Create and fit the model****model = LinearRegression()**model.fit(X_train, y_train)

**Step 7: Print intercept and coefficient (slope) of the best fit line****print(model.intercept_)**

print(model.coef_)

*I get the intercept as 2.018160041434683 and coefficient as 9.91065648. This means that for every one unit of change in hours studied, the change in the score is about 9.91%. Or in simpler words, if a student studies one hour more than he previously studied for an exam, he can expect to achieve an increase of 9.91% in the score.***Step 8: Predict from the model**y_pred = model.predict(X_test)

*The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.**Lets see the difference between the actual and predicted values.*df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

df

**Step 9: Check the accuracy**meanAbsoluteError =

**mean_absolute_error**(y_test, y_pred)meanSquaredError =

**mean_squared_error**(y_test, y_pred)rootMeanSquaredError = np.sqrt(meanSquaredError)

print('Mean Absolute Error:', meanAbsoluteError)

print('Mean Squared Error:', meanSquaredError)

print('Root Mean Squared Error:', rootMeanSquaredError)

*I got Root Mean Square Error as 4.6474476121003665 which is less than 10% of the mean value of the percentages of all the students i.e. 51.48 (look at step 2). This means that our algorithm did a decent job.*
## No comments:

## Post a Comment