Tuesday, 12 March 2019

What is t-SNE? How does it work using t-Distribution?

t-SNE stands for t-Distributed Stochastic Neighbor Embedding. It is a non-linear dimensionality reduction algorithm. t-SNE uses normal distribution (in higher dimension) and t-Distribution (in lower dimension) to reduce the dimensions of the dataset. We will see it in detail. 

As per documentation:

“t-Distributed stochastic neighbor embedding (t-SNE) minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding”.

t-SNE is a technique to convert high-dimensional data into lower dimensional data while keeping the relative similarity of the data points as close to the original (in high dimensional space) as possible.

Lets see how does t-SNE work? I will just illustrate a high level understanding of the algorithm (because even I don't understand the complex mathematics behind it :)).

Higher Dimension

Step 1: Picks a data point (say P1), calculates its Euclidean distance from a neighboring data point (say P2) and converts this distance into the conditional probability using normal distribution. 

Conditional probability represents the similarity between the pairs of data points.

Step 2: Again it calculates the distance of P1 from other neighboring point (say P3) and does the same as in Step 1.

It keeps doing same thing for all the data points. In this way t-SNE keeps computing pairwise conditional probabilities for each data point using normal distribution in higher dimension.

To summarize, t-SNE measures the similarity between each and every pair of data points. Similar data points will have more value of similarity and the different data points will have less value. Then it converts that similarity distance to the conditional probability according to the normal distribution. It also creates a similarity matrix (say S1).

Lower Dimension

Step 3: t-SNE arranges all of the data points randomly on the required lower dimension (say two dimensional space).

Step 4: It again does the same calculations for all the data points in the lower dimension as it did for all the data points in the higher dimension in Step 1 and Step 2. The only difference is that it uses t-Distribution in this case instead of normal distribution. That is why it is called t-SNE instead of simple SNE.

It also creates a similarity matrix in lower dimension (say S2).

Now t-SNE compares the similarity matrix S1 and S2 and tries to minimize the difference such that all the pairs have a similar probability distribution (both in higher and lower dimension). Gradient Descent is used with Kullback Leibler Divergence between the two distributions as a cost function. To measure the minimization of sum of difference of conditional probability, SNE minimizes the sum of Kullback-Leibler divergences of overall data points using a Gradient Descent method.

Difference between normal distribution and t-Distribution

t-Distribution is a lot like a normal distribution. The only difference is that t-distribution is not as tall as normal distribution in middle but its tails are taller at the ends. 

Why is t-Distribution used instead of normal distribution in lower dimension because without it the clusters would all clump up in the middle and will be harder to visualize.

Related:

Advantages and Disadvantages of t-SNE over PCA (PCA vs t-SNE)
Advantages and Disadvantages of Principal Component Analysis
Dimensionality Reduction: Feature Selection and Feature Extraction
Why is Dimensionality Reduction required in Machine Learning?

No comments:

Post a Comment