Lets implement Decision Tree algorithm in Python using Scikit Learn library. In my last article, we had solved a classification problem using Decision Tree. This time, we will solve a regression problem (predicting the petrol consumption in US) using Decision Tree.

We need to import

To measure the performance of a regression problem, we need to import

You can download

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error, mean_squared_error

names = ['Index', 'One', 'Petrol tax (cents per gallon)', 'Average income (dollars)',

'Paved Highways (miles)', 'Proportion of population with driver licenses',

'Consumption of petrol (millions of gallons)']

dataset = pd.read_csv('petrol_consumption.csv', names=names)

dataset.shape

dataset.head()

dataset.describe()

X = dataset.drop('Index', axis=1).drop('One', axis=1).drop('Consumption of petrol (millions of gallons)', axis=1)

y = dataset['Consumption of petrol (millions of gallons)']

X.head()

y.head()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

df

meanAbsoluteError =

meanSquaredError =

rootMeanSquaredError = np.sqrt(meanSquaredError)

print('Mean Absolute Error:', meanAbsoluteError)

print('Mean Squared Error:', meanSquaredError)

print('Root Mean Squared Error:', rootMeanSquaredError)

We need to import

**DecisionTreeRegressor**from**sklearn**library instead of**DecisionTreeClassifier**to implement Decision Tree to solve regression problem.To measure the performance of a regression problem, we need to import

**mean_absolute_error**and**mean_squared_error**metrics instead of**confusion_matrix, accuracy_score and classification_report**which we used in classification problem.You can download

**petrol_consumption.csv**from here. You can also download my Jupyter notebook containing below code of Decision Tree implementation.

**Step 1: Import the required Python libraries like pandas, numpy and sklearn**import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

**from sklearn.tree import DecisionTreeRegressor**from sklearn.metrics import mean_absolute_error, mean_squared_error

**Step 2: Load and examine the dataset**names = ['Index', 'One', 'Petrol tax (cents per gallon)', 'Average income (dollars)',

'Paved Highways (miles)', 'Proportion of population with driver licenses',

'Consumption of petrol (millions of gallons)']

dataset = pd.read_csv('petrol_consumption.csv', names=names)

dataset.shape

dataset.head()

dataset.describe()

*Please note that "describe()" is used to display the statistical values of the data like mean and standard deviation.***Step 3: Mention X and Y axis**X = dataset.drop('Index', axis=1).drop('One', axis=1).drop('Consumption of petrol (millions of gallons)', axis=1)

*Please note that first two columns "Index" and "One" are of no use for making any prediction. So dropped these two features. Also dropped the label which is the last column.*

*X contains the list of attributes**Y contains the list of labels*y = dataset['Consumption of petrol (millions of gallons)']

X.head()

y.head()

**Step 4: Split the dataset into training and testing dataset**X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

**Step 5: Create and fit the model****model =****DecisionTreeRegressor****()**model.fit(X_train, y_train)

**Step 6: Predict from the model**y_pred = model.predict(X_test)

*The y_pred is a numpy array that contains all the predicted values for the input values in the X_test.**Lets see the difference between the actual and predicted values.*df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

df

**Step 7: Check the accuracy**meanAbsoluteError =

**mean_absolute_error**(y_test, y_pred)meanSquaredError =

**mean_squared_error**(y_test, y_pred)rootMeanSquaredError = np.sqrt(meanSquaredError)

print('Mean Absolute Error:', meanAbsoluteError)

print('Mean Squared Error:', meanSquaredError)

print('Root Mean Squared Error:', rootMeanSquaredError)

## No comments:

## Post a Comment