Self-Supervised Learning (SSL)

Self-supervised learning is a type of machine learning where the model generates its own supervisory signal from the input data. It doesn’t rely on external labels.

Speaking with Hemal Shah

This is one of the most powerful ways to train our problems because we are exploiting the structure of our data to produce labels. We no longer need to rely on human labels.

Learned from Lilian Weng: https://www.youtube.com/watch?v=7l6fttRJzeU&t=318s&ab_channel=ArtificialIntelligence

This is a subset of unsupervised learning where we try to minimize our loss by trying replicate our output data to our input data.

Two methods:

  1. Self-Prediction: Given an individual data sample, the task is to predict one part of the sample given the other part. Ex
    1. crop part of an image and train a model to predict the pixels cropped, or
    2. decolourize our original image, and predict a colourized image
  2. Contrastive Learning: Given multiple data samples, the task is to predict the relationship among them. Ex:
    1. Embeddings?

Pretext tasks (CS231n 2025 Lec 12)

A pretext task is a hand-designed surrogate objective whose labels are derivable from the data itself. Train a feature encoder to solve it, throw away the head, and use the backbone features for downstream tasks.

TaskPaperSignal
Rotation predictionGidaris et al. 2018predict one of {0°, 90°, 180°, 270°} applied to input
Relative patch locationDoersch et al. 2015given center patch + neighbor, predict which of 8 positions the neighbor came from
Jigsaw puzzleNoroozi & Favaro 2016shuffle 3Ă—3 patches, predict the permutation
Inpainting (Context Encoders)Pathak et al. 2016mask a large region, regress the pixels
ColorizationZhang et al. 2016grayscale → color (ab channels of Lab)
Split-brain autoencoderZhang et al. 2017two half-networks predict each other’s channels
Video coloringVondrick et al. 2018propagate color from reference frame to target frame using learned correspondences

The problem: each pretext task is a hand-tuned heuristic. The modern answer (contrastive / masked reconstruction) replaces the heuristic with an objective that scales.

Masked Autoencoders (MAE, He 2021)

Pretext = mask a large fraction (75%) of ViT patches, reconstruct pixels of masked patches.

Two design choices make MAE different from BERT-for-images:

  1. Asymmetric encoder-decoder. The encoder only sees the 25% visible patches (no mask tokens) — slashes compute ~4×. A lightweight decoder takes encoder output + learnable mask tokens at masked positions, reconstructs the full image.
  2. Loss on masked patches only. MSE on pixels, only where masked. Forces the encoder to learn content, not identity.

ViT-H at 448 resolution → 87.8% ImageNet top-1 (best ImageNet-1K-only at publication time). Transfer to COCO detection beats supervised pretraining.

Useful for: any downstream vision task where you have images but not labels. Linear probing is weaker than full fine-tuning for MAE (unlike contrastive methods) — its features are less linearly separable but fine-tune-able.

See the full paper breakdown: Masked Autoencoders Are Scalable Vision Learners.

Contrastive framework

A general shift from pretext tasks — see Contrastive Learning for the full InfoNCE treatment and the SimCLR / MoCo / CPC / DINO family.

Source

CS231n 2025 Lec 12 slides ~1–66 (pretext task zoo, MAE, agenda for contrastive). 2026 PDF not published — using 2025 fallback.