Kullback-Leibler Divergence (KL Divergence)

Saw this word from here: https://spinningup.openai.com/en/latest/algorithms/trpo.html

KL divergence measures how much information is lost if the distribution Y is used to represent X.

This seems to be super used as a way to measure the distances between two distributions in the Machine Learning world. However, there seems to be a recent interest in computing the Sinkhorn Divergence instead.

Notation from here. The Kullback-Leibler divergence from the distribution $Q$ to the distribution $P$ is defined as

$K L (P ∥ Q) = \int_{X} p (x) lo g \frac{q ( x )}{p ( x )} d x$ From DL textbook $DK L (P ∣∣ Q) = E_{x - P} l o g P (x) \frac{P ( x )}{Q ( x )} = E_{x \sim P} [l o g P (x) - l o g Q (x)]$

KL divergence vs. Cross-Entropy?

https://stats.stackexchange.com/questions/357963/what-is-the-difference-between-cross-entropy-and-kl-divergence

where $p$ and $q$ are the respective densities of $P$ and $Q$ .

Interesting, they also talk bout this idea in F1TENTH for Particle Filter.

Also hearing about it while learning about EMD.

https://dfdazac.github.io/sinkhorn.html “It can be shown1 that minimizing $K L (p ‖ q)$ is equivalent to minimizing the Negative Log Likelihood, which is what we usually do when training a classifier, for example. In the case of the Variational Autoencoder, we want the approximate posterior to be close to some prior distribution, which we achieve, again, by minimizing the KL divergence between them.

“often dubbed Cross-Entropy Loss in the Deep Learning context” from here

It can be shown that solving the Maximum Likelihood Estimation problem is equivalent to minimizing the Kullback-Leibler divergence.

KL Divergence of normal distribution

https://statproofbook.github.io/P/norm-kl.html

Practice it here:

https://statproofbook.github.io/P/norm-kl.html

🛠️ Steven Gong

Table of Contents

Kullback-Leibler Divergence (KL Divergence)

KL Divergence of normal distribution

Graph View

Backlinks