Normalizing Flow

A normalizing flow is transforms a simple distribution (generally a Unit Gaussian) into a complex, data-like distribution using a sequence of invertible and differentiable functions.

This is an idea that I saw Billy Zheng write about https://hongruizheng.com/2020/03/13/normalizing-flow.html

Resources

https://lilianweng.github.io/posts/2018-10-13-flow-models/

Normalizing flow learns an invertible transformation $f$ between data and latent variables: $x = f (z), z = f^{- 1} (x)$

$x$ is a data sample
$z$ is a latent variable sampled from a simple distribution

You can't just have z

The function $f$ in normalizing flows is perfectly invertible. In normalizing flows, we care about density estimation, not reconstruction. The loss is based on the log-likelihood of data $x$ under the model (more below).

that would basically be an Autoencoder

We can write this in terms of probability density function (comes from the Change-of-Variable Formula theorem in probability): $x = f (z), p (x) = p (z) det (\frac{\partial f ^{- 1}}{\partial x})$

I don't understand where $det(\frac{\delta f^{-1}}{\delta x})$ comes from?)

In training, data flows from $x \to z$ , use data $x$ , and apply the inverse flow: $z = f^{- 1} (x)$ . We minimize the loss over $f^{- 1}$ . Compute the Log Likelihood using the change-of-variable formula: $lo g p_{X} (x) = lo g p_{Z} (z) + lo g det (\frac{\partial f ^{- 1}}{\partial x})$
- Notice the magic, because below, we can then use $f$ ! All thanks to the fact that $f$ is invertible.
In sampling / generation, Data flows from: $z \to x$ , sample $z \sim N (0, I)$ from the base distribution, and apply the forward flow: $x = f (z)$

Difference with VAE?

It’s how we formulate the loss. In VAE, the encoder and decoder are separate networks. Also:

Normalizing flow is deterministic: No randomness is added when transforming $x \leftrightarrow z$ . It’s exact, we know what happens.

VAE is stochastic: it uses a stochastic encoder that samples $z$ from a learned distribution $q (z ∣ x)$ . It outputs a distribution.

So it seems that they both model gaussians:

In flows, the Gaussian is transformed through exact, invertible functions to match the data.

In VAEs, the model learns to approximate the mapping between data and latent Gaussian through separate encoder/decoder networks.

Sampling:

We apply a chain of invertible transformations:

x = f_{K} \circ f_{K - 1} \circ \dots \circ f_{1} (z_{0})

$z_{0} \sim N (0, I)$ : latent variable sampled from a standard Gaussian
Each $f_{i}$ : an invertible transformation (e.g., affine coupling layer)
Output $x$ : a realistic-looking data sample

What does $f$ look like? Depends on the model. They’re generally just Affine Transforms, since those are differentiable.

Forward pass:

y_{a} y_{b} = x_{a} = x_{b} ⊙ exp (s_{θ} (x_{a})) + t_{θ} (x_{a})

Inverse pass:

x_{a} x_{b} = y_{a} = (y_{b} - t_{θ} (y_{a})) ⊙ exp (- s_{θ} (y_{a}))

$s_{θ}$ and $t_{θ}$ are neural networks (often small CNNs or MLPs).
Same parameters are reused in both directions.

How Weight Updates Work in Flow-Based Models

Training is done via maximum likelihood estimation (MLE) using the change-of-variables formula.

Change of Variables

Given $x = f (z)$ and $z \sim N (0, I)$ :

lo g p_{X} (x) = lo g p_{Z} (f^{- 1} (x)) + lo g det (\frac{\partial f ^{- 1}}{\partial x})

Rewriting using forward Jacobian:

lo g p_{X} (x) = lo g p_{Z} (z) - lo g det (\frac{\partial f}{\partial z})

The fundamental differnce

This loss uses a jacobian, it’s derived from the change of variables formula.

We’re not just doing L = x - f(z)

Training Steps

Inverse pass: Given data $x$ , compute $z = f^{- 1} (x)$
Compute log-likelihood loss:

L = - lo g p_{Z} (z) + lo g det (\frac{\partial f}{\partial z})

Backpropagate through:
- the inverse transformations $f^{- 1}$
- the neural nets $s_{θ}$ and $t_{θ}$
- the log-determinant term (structured for easy computation in RealNVP/Glow)
Gradient Descent:
- Use Adam/SGD to update parameters $θ$ in $s_{θ}$ and $t_{θ}$

Normalizing flow gives you an explicit representation of density functions.

Kernel Density Estimation

Used a lot in Generative Model.

Flow Matching

🛠️ Steven Gong

Table of Contents

Normalizing Flow

How Weight Updates Work in Flow-Based Models

Change of Variables

Training Steps

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Normalizing Flow

How Weight Updates Work in Flow-Based Models

Change of Variables

Training Steps

Related

Graph View

Backlinks