Kullback-Leibler Divergence

Saw this word from here: https://spinningup.openai.com/en/latest/algorithms/trpo.html

This seems to be super used as a way to measure the distances between two distributions in the Machine Learning world. However, there seems to be a recent interest in computing the Sinkhorn Divergence instead.

Interesting, they also talk bout this idea in F1TENTH for Particle Filter.

Also hearing about it while learning about EMD.

Notation from here. The Kullback-Leibler divergence from the distribution to the distribution is defined as

where and are the respective densities of and .

https://dfdazac.github.io/sinkhorn.html “It can be shown1 that minimizing  is equivalent to minimizing the Negative Log Likelihood, which is what we usually do when training a classifier, for example. In the case of the Variational Autoencoder, we want the approximate posterior to be close to some prior distribution, which we achieve, again, by minimizing the KL divergence between them.

“often dubbed Cross-Entropy Loss in the Deep Learning context” from here

It can be shown that solving the Maximum Likelihood Estimation problem is equivalent to minimizing the Kullback-Leibler divergence.