Pages

Monday, 25 March 2019

Hypothesis Generation: Null Hypothesis (Ho) vs Alternate Hypothesis (Ha) in Machine Learning

Hypothesis generation is a process of creating a set of features which could influence the target variable given a confidence interval (taken as 95% all the time). We can do this before looking at the dataset to avoid biased thoughts. This step often helps in creating new features. 

Domain knowledge is very important in hypothesis generation. Before looking at the data, you should know what important features it must have. Let us consider a Ames Housing dataset and our aim is to predict the house prices. What factors can you think of right now which can influence house prices? You should write down your factors as well, then we can match them with the features available in original dataset. 

Defining a hypothesis has two parts

1. Null Hypothesis (Ho) 
2. Alternate Hypothesis(Ha).

Ho - There exists no impact of a particular feature on the dependent variable. 
Ha - There exists a direct impact of a particular feature on the dependent variable.

Based on a decision criterion (say, 5% significance level), we always 'reject' or 'fail to reject' the null hypothesis in statistical parlance. Practically, while model building, we look for probability (p) values. If p value < 0.05, we reject the null hypothesis. If p > 0.05, we fail to reject the null hypothesis. 

Some factors which I can think of that directly influence house prices are the following:

Location of house
Area of house
Floors in the house
Age of house
Proximity to market, school, hospital, parks
Availability of public transport
Water / Electricity availability
Car parking
What material is used in the construction
If terrace is available
If security is available

In this way, you can think of a lot of features before looking into the database. As per my domain knowledge, above features must be there in the dataset as these features must influence the sales prices of the houses.

No comments:

Post a Comment