A Simple Framework for Contrastive Learning of Visual Representations

How do they prevent mode collapse?

Walkthrough (CS231n 2025 Lec 12)

The pipeline

$x t \sim T \tilde{x}_{i} f (\cdot) h_{i} g (\cdot) z_{i}$

$T$ is the augmentation distribution (random crop + color distortion + Gaussian blur — the lec notes these are the crucial three).
$f (\cdot)$ is the feature encoder (ResNet-50 in the paper). $h$ is what you keep for downstream tasks.
$g (\cdot)$ is a small MLP projection head. $z$ is where the contrastive loss is applied. Throw $g$ away at inference.

Minibatch algorithm

For a minibatch of $N$ images:

Draw $t, t^{'} \sim T$ , apply both to every image → $2 N$ augmented views.
Encode with shared encoder + projection → $z \in R^{2 N \times D}$ .
Build affinity matrix $s_{i, j} = z_{i}^{T} z_{j} / (∥ z_{i} ∥∥ z_{j} ∥)$ — cosine similarity, shape $2 N \times 2 N$ .
For each row $i$ , the positive is at position $2 k$ or $2 k + 1$ (partner view of the same source image); all other $2 N - 2$ entries are negatives.
InfoNCE per row: $ℓ (i, j) = - lo g \frac{e x p ( s _{i, j} / τ )}{\sum _{k \neq = i} e x p ( s _{i, k} / τ )}$ Total loss averages $ℓ (2 k, 2 k + 1) + ℓ (2 k + 1, 2 k)$ over all $N$ source images.

Why the projection head helps

Representation $z$ is trained to be invariant under augmentation — that collapses useful signal (e.g. color, orientation). $h$ sits one MLP away, so it keeps information that $z$ had to throw out. Linear eval on $h$ beats linear eval on $z$ consistently; SimCLR ablation table shows non-linear projection head ~7 points better than no head.

Why large batch matters

Larger batch = more negatives = tighter MI lower bound ( $M I \geq - L - lo g N$ ). Paper sweeps batches from 256 to 8192 and every doubling improves ImageNet linear eval Top-1. Batch 8192 requires TPU pods — the motivation for MoCo’s decoupling trick.

Downstream numbers (Chen 2020)

Linear eval ImageNet Top-1: SimCLR 69.3% (ResNet-50), SimCLR 4× 76.5% — matches supervised ResNet-50.
1% label fine-tune, Top-5: SimCLR 4× 85.8% → beats AlexNet trained with 100× more labels.

Source

CS231n 2025 Lec 12 slides ~76–86, 105–108 (SimCLR architecture + pipeline, minibatch algorithm, affinity matrix, projection head ablation, large-batch ablation, semi-supervised table, summary).

🛠️ Steven Gong

Table of Contents

A Simple Framework for Contrastive Learning of Visual Representations

Walkthrough (CS231n 2025 Lec 12)

The pipeline

Minibatch algorithm

Why the projection head helps

Why large batch matters

Downstream numbers (Chen 2020)

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

A Simple Framework for Contrastive Learning of Visual Representations

Walkthrough (CS231n 2025 Lec 12)

The pipeline

Minibatch algorithm

Why the projection head helps

Why large batch matters

Downstream numbers (Chen 2020)

Source

Related

Graph View

Backlinks