Generative Model

Generative Adversarial Network (GAN)

A GAN trains a generator and discriminator in a minimax game so the generator learns to produce samples indistinguishable from real data.

where:

  • is the generator mapping noise to a sample
  • is the discriminator estimating the probability a sample is real

Intuition

A two-player game between a forger and a detective. The generator forges samples from noise, the discriminator tries to tell forgeries from real data, and both improve by playing each other. At Nash equilibrium the forgeries are indistinguishable, i.e. . Notice there’s no explicit density to write down: we never compute , we just need to sample from it, which is exactly what does.

To steven: you’re probably looking for A Neural Representation of Sketch Drawings and https://quickdraw.withgoogle.com/.

Resources:

GANs are composed of two Neural Nets:

  • The generator network creates new synthetic images trying to fool the discriminator
  • The discriminator network learns to tell a fake, synthesized image from a real one

The competition between both networks allows them to improve, until the generator becomes so good that fake samples cannot be distinguished from real ones.

Examples: DCGAN, StyleGAN, CycleGAN

Mathematical Formalization

Formally, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution.

and play the following two-player minimax game with value function :

where:

  • and are differentiable functions represented by an MLP
  • is the probability that came from the data rather than
  • is trained to maximize the probability of assigning the correct label to both training examples and samples from
  • takes as input and outputs an image, trained to minimize

With a GAN, given enough iterations, converges to . In other words, from a random vector , the network can synthesize an image that resembles one drawn from the true distribution .

The discriminator is a traditional CNN. The generator takes in random noise and produces an output.

Wasserstein GAN

Wasserstein Metric GAN: https://www.youtube.com/watch?v=xs9uibPODGk&ab_channel=Oneworldtheoreticalmachinelearning

Wasserstein GAN Paper: https://arxiv.org/pdf/1701.07875.pdf

Walkthrough (CS231n 2025 Lec 14)

Setup

GAN is the implicit-direct branch of the Goodfellow taxonomy: no explicit density , just sample. Sample (e.g. ), pass through generator , get from implicit distribution . The goal is , but we have no way to compute , so we can’t do MLE. Trick: train a discriminator to classify real vs fake, then use it as the loss for .

Minimax objective

  • Inner max: wants to push real samples toward 1 and fake toward 0
  • Outer min: wants (fool the discriminator)

Trained via alternating gradient updates: , then .

No single loss curve to monitor

You’re chasing a saddle point, not minimizing a function. Loss values on their own don’t tell you whether training is working

Saturation problem & non-saturating fix

At the start of training, is bad, so and is flat with near-zero gradient. can’t learn from a vanishing signal.

Fix (Goodfellow 2014): instead of minimizing , train to maximize . Same direction (fool ), but the gradient is large precisely when is winning. Now standard.

Optimal discriminator

For fixed , the optimal has a closed form:

Plugging back into the outer min, the global optimum is . If is optimal at every step, converges to the data distribution. Caveats in practice: finite-capacity nets, no convergence guarantee from alternating SGD, mode collapse.

Read as a likelihood ratio: at points where real data is more likely than fake, ; where fake is more likely, . The only way for to be exactly everywhere (maximally confused detective) is .

DC-GAN (Radford et al. ICLR 2016)

First architecture to make GANs work on non-toy data: strided/transposed convolutions, BatchNorm, no fully-connected layers, ReLU/LeakyReLU. Aside: Alec Radford later did GPT-1 and GPT-2.

StyleGAN (Karras et al. CVPR 2019)

Two-stage generator: a mapping network produces a style vector, and a synthesis network generates the image, conditioning every layer on via AdaIN (Adaptive Instance Norm):

Per-layer style injection lets you control coarse vs fine attributes independently.

Latent-space interpolation

Linearly interpolate and decode . The output sweeps smoothly between the two endpoints. StyleGAN3 cat morphs are the canonical demo.

Summary: pros & cons

  • Pros: simple objective, very high sample quality, fast single-step generation
  • Cons: no loss curve to monitor, training is unstable (mode collapse, vanishing gradients), hard to scale, no explicit for likelihood evaluation
  • Era: GANs were the go-to image generator from ~2016-2021, then displaced by diffusion

From CS231n 2025 Lec 14 slides ~14-35 (minimax setup, alternating updates, saturation problem and non-saturating fix, optimal discriminator derivation, DC-GAN, StyleGAN AdaIN, latent interpolation, GAN summary pros/cons).