Binning is a quantization technique in Machine Learning to handle continuous variables. It is one of the important steps in Data Wrangling. Binning transforms the continuous variables into groups, ranges or intervals called bins.
For example, consider a dataset containing a variable which stores age of the people. This age is a continuous variable which can range from 1 to 100+. Analyzing this data is difficult. Using binning technique, we can convert all the values in this variable into ranges.
Types of Binning
There are two types of binning techniques:
1. Fixed-Width Binning
2. Adaptive Binning
Lets discuss them one by one:
1. Fixed-Width Binning
We manually create fix width bins based on some rules and domain knowledge. Consider that we have following 15 values in the age column:
age = [12, 15, 13, 78, 65, 42, 98, 24, 26, 38, 27, 32, 22, 45, 27]
Now, lets create bins of fixed width (say 10):
bins = [0 {0-9}, 1 {10-19}, 2 {20-29}, 3 {30-39}, 4 {40-49}, 5 {50-59}, 6 {60-69}, 7 {70-79}, 8 {80-89}, 9 {90-99}]
After binning, our age variable looks like this:
age = [1, 1, 1, 7, 6, 4, 9, 2, 2, 3, 2, 3, 2, 4, 2]
In this way, all the 15 values will fit in above 10 ranges / bins. Just think of a dataset containing thousands of values in the age column instead of just 15! How useful it would be in this case!
2. Adaptive Binning
In Fixed-Width Binning, bin ranges are manually decided. So, we usually end up in creating irregular bins which are not uniform based on the number of data points or values which fall under each bin. Some of the bins might be densely populated and some of them might be sparsely populated or even empty.
For example, bins 0, 5 and 8 are empty in our case.
In Adaptive Binning, data distribution itself decides bin ranges for itself. No manual intervention is required. So, the bins which are created are uniform in terms of number of data points in it.
Quantile based binning is a good strategy to use for adaptive binning. Quantiles are specific values or cut-points which help in partitioning the continuous valued distribution of a specific numeric field into discrete contiguous bins or intervals. Thus, q-Quantiles help in partitioning a numeric attribute into q equal partitions.
Popular examples of quantiles include the 2-Quantile known as the median which divides the data distribution into two equal bins, 4-Quantiles known as the quartiles which divide the data into 4 equal bins and 10-Quantiles also known as the deciles which create 10 equal width bins.
Advantage of Binning: It finds a set of patterns in continuous variables which are easy to analyze and interpret
Disadvantage of Binning: Binning leads to loss of information. The original data is converted into the bins.
For example, consider a dataset containing a variable which stores age of the people. This age is a continuous variable which can range from 1 to 100+. Analyzing this data is difficult. Using binning technique, we can convert all the values in this variable into ranges.
Types of Binning
There are two types of binning techniques:
1. Fixed-Width Binning
2. Adaptive Binning
Lets discuss them one by one:
1. Fixed-Width Binning
We manually create fix width bins based on some rules and domain knowledge. Consider that we have following 15 values in the age column:
age = [12, 15, 13, 78, 65, 42, 98, 24, 26, 38, 27, 32, 22, 45, 27]
Now, lets create bins of fixed width (say 10):
bins = [0 {0-9}, 1 {10-19}, 2 {20-29}, 3 {30-39}, 4 {40-49}, 5 {50-59}, 6 {60-69}, 7 {70-79}, 8 {80-89}, 9 {90-99}]
After binning, our age variable looks like this:
age = [1, 1, 1, 7, 6, 4, 9, 2, 2, 3, 2, 3, 2, 4, 2]
In this way, all the 15 values will fit in above 10 ranges / bins. Just think of a dataset containing thousands of values in the age column instead of just 15! How useful it would be in this case!
2. Adaptive Binning
In Fixed-Width Binning, bin ranges are manually decided. So, we usually end up in creating irregular bins which are not uniform based on the number of data points or values which fall under each bin. Some of the bins might be densely populated and some of them might be sparsely populated or even empty.
For example, bins 0, 5 and 8 are empty in our case.
In Adaptive Binning, data distribution itself decides bin ranges for itself. No manual intervention is required. So, the bins which are created are uniform in terms of number of data points in it.
Quantile based binning is a good strategy to use for adaptive binning. Quantiles are specific values or cut-points which help in partitioning the continuous valued distribution of a specific numeric field into discrete contiguous bins or intervals. Thus, q-Quantiles help in partitioning a numeric attribute into q equal partitions.
Popular examples of quantiles include the 2-Quantile known as the median which divides the data distribution into two equal bins, 4-Quantiles known as the quartiles which divide the data into 4 equal bins and 10-Quantiles also known as the deciles which create 10 equal width bins.
Advantage of Binning: It finds a set of patterns in continuous variables which are easy to analyze and interpret
Disadvantage of Binning: Binning leads to loss of information. The original data is converted into the bins.
No comments:
Post a Comment