Contrastive Learning
The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other, while dissimilar ones are far apart.
Resources
- https://lilianweng.github.io/posts/2021-05-31-contrastive/
- https://www.youtube.com/watch?v=7l6fttRJzeU&t=318s&ab_channel=ArtificialIntelligence
- I first First seen from here: https://github.com/wjf5203/VNext
Ohh, I actually found about about this, I spoke with the people at Cohere AI at HackWestern.
We have the following categorization
- Inter-sample classification (most dominant)
- Given both similar (βpositiveβ) and dissimilar (βnegativeβ) candidates, to identify which ones are similar to the anchor data point is a classification task
- Feature Clustering
- Find similar data samples by clustering them with learned features
- Multiview coding
- Apply the InfoNCE objective to two or more different views of input data
The CLIP model enables Zero-Shot classification.
There are creative ways to construct a set of data point candidates:
- The original input and its distorted version
- Data that captures the same target from different views
Some loss:
- Contrastive Loss (works with labelled dataset)
- Triplet Loss
- N-Pair Loss (generalizes triplet loss)
- Lifted Structured Loss
- Noise Contrastive Estimation (NCE)
- InfoNCE - see contrastive loss, used by CLIP
- Soft-Nearest Neighbors Loss
Resources
Some contrastive models
Framework (CS231n 2025 Lec 12)
Given an anchor , a positive (semantically similar β typically another augmentation of ), and negatives (other samples), learn an encoder and a score function such that:
InfoNCE loss (van den Oord 2018) frames this as an -way softmax classification β pick the positive out of 1 positive + negatives:
Minimizing maximizes a lower bound on mutual information between anchor and positive:
So more negatives β tighter MI bound β better representation (intuition behind why SimCLR/MoCo push hard for large ).
Instance vs sequence contrastive
| Level | Positives from | Examples |
|---|---|---|
| Instance | two augmentations of the same image | [[research-papers/A Simple Framework for Contrastive Learning of Visual Representations |
| Sequence | future timesteps of the same sequence given past context | Contrastive Predictive Coding (CPC) |
SimCLR (Chen 2020)
- Given a minibatch of images, draw two augmentation functions β produce augmented views.
- Encode each with β , then project with β . At inference throw away , keep only .
- Score = cosine similarity β builds a affinity matrix; positive pair for view is its partner at offset.
- InfoNCE over the rows of the affinity matrix.
- Non-linear projection head is crucial β representation space stays richer when the invariance pressure is applied in a separated space.
- Large batch crucial β batch 8192 beats batch 256 by ~5 points on ImageNet linear eval. Requires TPU pods for memory.
- Top-1 ImageNet linear eval: SimCLR (4Γ) 76.5%, matching supervised ResNet-50.
CPC (van den Oord 2018) β sequence-level
- Encode each timestep: .
- Summarize context: using an autoregressive model (original paper uses GRU-RNN).
- InfoNCE between context and future code with time-dependent score: is a trainable matrix, one per future offset . Negatives are codes from other sequences / random positions.
Applied to audio (LibriSpeech β 64.6% phone, 97.4% speaker linear probe vs supervised 74.6% / 98.5%), and to images by raster-scanning 64Γ64 patches with 50% overlap (top-down context predicts bottom rows).
MoCo (He 2020)
See MoCo paper note. Key idea: FIFO queue of keys from a momentum encoder β decouples#negatives from batch size. Gradient flows only through the query encoder; the key encoder is detached and updated by EMA.
DINO (Caron 2021) β self-distillation without negatives
See DINO paper note. No explicit negatives: student matches a teacher (EMA of student) via cross-entropy on softmax outputs. Collapse prevented by centering (subtract running mean from teacher logits) + sharpening (low teacher temperature). Emergent property: attention maps of a ViT trained with DINO produce unsupervised object segmentation.
Source
CS231n 2025 Lec 12 slides ~67β110 (contrastive framework, InfoNCE + MI bound, SimCLR full algorithm + batch-size ablation, MoCo queue + momentum update, MoCo-v2 hybrid, CPC sequence formulation + audio/image results, DINO self-distillation + centering/sharpening). 2026 PDF not published β using 2025 fallback.
Related
- Contrastive learning enables Transfer Learning
- Self-Supervised Learning
- InfoNCE
- SimCLR
- MoCo
- DINO