Saw this word from here: https://spinningup.openai.com/en/latest/algorithms/trpo.html
This seems to be super used as a way to measure the distances between two distributions in the Machine Learning world. However, there seems to be a recent interest in computing the Sinkhorn Divergence instead.
Also hearing about it while learning about EMD.
Notation from here. The Kullback-Leibler divergence from the distribution to the distribution is defined as
where and are the respective densities of and .
https://dfdazac.github.io/sinkhorn.html “It can be shown1 that minimizing is equivalent to minimizing the Negative Log Likelihood, which is what we usually do when training a classifier, for example. In the case of the Variational Autoencoder, we want the approximate posterior to be close to some prior distribution, which we achieve, again, by minimizing the KL divergence between them.
It can be shown that solving the Maximum Likelihood Estimation problem is equivalent to minimizing the Kullback-Leibler divergence.