CS231n β Deep Learning for Computer Vision
Stanfordβs flagship deep-learning-for-vision course, originally designed by Andrej Karpathy. Light on theory, heavy on intuition built around backprop through computational graphs.
- Course page: https://cs231n.stanford.edu/
- Schedule (2026): https://cs231n.stanford.edu/schedule.html
- Online notes: https://cs231n.github.io/
Lectures
1. Foundations (Lec 1β4)
- Lec 1: Course intro, history of CV β biological vision, Hubel & Wiesel, ImageNet
- Lec 2: Image Classification β data-driven approach, kNN, hyperparameters & validation, Linear Classifier, Softmax vs SVM / Hinge Loss
- Lec 3: Regularization (L1/L2/elastic net) & Optimization β SGD problems (conditioning, saddle points, noise), SGD+Momentum, RMSProp, Adam, AdamW, LR schedules
- Lec 4: Neural Networks & Backpropagation β feature transforms / why nonlinearity, computational graphs, gradient-flow patterns (add/mul/copy/max gates), modular forward/backward API, vector & matrix backprop with implicit Jacobians, SiLU activations
2. CNNs and CNN Architectures (Lec 5β6)
- Lec 5: Convolutional Neural Networks β why preserve spatial structure, conv layer shapes (), padding/stride formulas, receptive fields (), 1Γ1 conv, pooling, translation equivariance vs invariance, what filters learn
- Lec 6: Training CNNs & CNN Architectures β LayerNorm / BatchNorm / Dropout, SiLU, ImageNet history (AlexNet β VGG β GoogLeNet β ResNet), VGGβs all-3Γ3 design, ResNet residual block (), Kaiming init, preprocessing & augmentation, transfer learning (similarity Γ data-size matrix), hyperparam workflow + loss curve diagnostics
3. Sequence models (Lec 7β8)
- Lec 7: RNNs, LSTM, GRU β sequence patterns (1-to-many / many-to-1 / many-to-many / seq2seq), vanilla RNN , char-level LM + min-char-rnn, BPTT + truncated BPTT, image captioning (CNNβRNN with injection), multilayer RNNs, vanilla RNN gradient flow (product of βs β vanish/explode), gradient clipping, LSTM (i/f/o/g gates, uninterrupted gradient path β ResNet), GRU, Seq2Seq encoder-decoder
- Lec 8: Attention & Transformers β Bahdanau attention from seq2seq bottleneck, generalized attention layer (scaled dot product, separate K/V), self-attention as a permutation-equivariant set operator, masked SA, multi-head SA as 4 matmuls ( memory, Flash Attention fix), Transformer block (SA + MLP, residuals, LN, 6 matmuls), scale chart 213M β 175B, LLM recipe (embedding β masked Transformer β projection β softmax), modern tweaks (Pre-Norm, RMSNorm, SwiGLU, MoE), Vision Transformer (16Γ16 patches β linear projection β‘ strided conv β 2D positional encoding β unmasked self-attention β average pool + classifier)
4. Detection, Segmentation, Visualization (Lec 9)
- Lec 9: Semantic Segmentation (sliding window β fully convolutional, encoder-decoder with unpooling / transposed conv, U-Net copy-and-crop), Object Detection (single-object multitask loss β multiple-object problem β Selective Search + R-CNN β Fast R-CNN with RoI Pool/Align β Faster R-CNN with RPN/anchors β single-stage YOLO/SSD/RetinaNet β DETR set prediction), instance segmentation (Mask R-CNN = Faster R-CNN + per-class 28Γ28 mask head, also pose), Feature Visualization (first-layer filters, saliency via backprop, guided backprop, CAM, Grad-CAM, ViT attention)
5. Video, Distributed Training, Self-Supervised (Lec 10β12)
- Lec 10: Video understanding β video as 4D tensor (, train short clips, test-time clip ensemble), 3D CNN (βSlow Fusionβ) with shift-invariance comparison, C3D (βVGG of 3D CNNsβ, 39.5 GFLOP), Sports-1M, recognizing actions from motion (Johansson 1973), Optical Flow + Two-Stream Networks (temporal stream beats spatial on UCF-101), CNN+LSTM combos and Recurrent Convolutional Network (Ballas β replace matmul with 2D conv), Nonlocal Block (Wang CVPR 2018, drop-in to 3D CNNs via 1Γ1Γ1 Q/K/V convs), I3D β Inflating 2D Networks to 3D (Carreira & Zisserman, copy-and-divide-by- init trick), Video Transformers (factorized attention ViViT/TimeSformer, pooling MViTv2, masked autoencoders VideoMAE/V2), Kinetics-400 climb I3D 71.1 β SlowFast+NL 79.8 β MViTv2-L 86.1 β VideoMAE V2-g 90, visualizing video models (Appearance vs Slow vs Fast motion), Temporal Action Localization (Faster-R-CNN-style temporal proposals), Spatio-Temporal Detection (AVA dataset), audio-visual video understanding (McGurk, visually-guided audio source separation), efficient video (MoViNets, X3D, SCSampler, AdaMML, Listen to Look), egocentric (Project Aria), Video + LLMs (Video-LLaVA, Video-ChatGPT, VideoLLaMA 3)
- Lec 11: Large-scale distributed training β H100 hardware (HBM 80GB/3352 GB/s, 132 SMs, Tensor Cores 4096 FLOP/cycle), K40βB200 FLOPs timeline, Data Parallelism β FSDP (ZeRO sharding of params/grads/Adam states, all_gather + reduce_scatter per layer) β HSDP (intra-node FSDP + inter-node DP), Llama3-405B memory arithmetic (800GB β 10GB/GPU), activation checkpointing (-optimum compute/memory tradeoff), HFU vs MFU (>30% good, >40% excellent; GPT-3 21%, PaLM 46%), ND parallelism over (Batch, Seq, Dim) = DP/CP/PP/TP, Context Parallelism (Ring Attention, DeepSpeed Ulysses), Pipeline Parallelism (GPipe bubble + microbatches, active fraction), Tensor Parallelism (column/row sharding two-layer no-comm trick), Llama3-405B 4D recipe table (8Kβ16K GPUs, seq 8192β131072, MFU 43%β41%β38%)
- Lec 12: Self-Supervised Learning β pretext tasks (rotation prediction / Gidaris 2018, relative patch location / Doersch 2015, jigsaw / Noroozi 2016, inpainting via Context Encoders / Pathak 2016, colorization / split-brain autoencoder Zhang 2017, video coloring Vondrick 2018), Masked Autoencoders (MAE) (75% masking, asymmetric encoder on visible-only patches + lightweight decoder, MSE on masked patches, ViT-H 448 β 87.8% ImageNet), contrastive learning framework (attract , repel , InfoNCE as -way softmax, MI lower bound ), SimCLR (2 batch, cosine similarity affinity matrix, non-linear projection thrown away post-train, large batch crucial β 8192 on TPU), MoCo (FIFO queue of negatives + momentum encoder , decouples batch size from#negatives, MoCo-v2 = MoCo queue + SimCLR MLP head + strong aug, 67.5% vs SimCLR 66.6% at 256-batch), instance-level (SimCLR/MoCo) vs sequence-level contrastive, CPC (van den Oord 2018 β encode , summarize context via GRU-RNN, InfoNCE with time-dependent score , applies to audio / image patches), DINO (self-distillation, student + teacher from EMA, cross-entropy with teacher centering + sharpening to prevent collapse, ViT 8Γ8 patches β emergent unsupervised object segmentation)
6. Generative Models (Lec 13β14)
- Lec 13: Generative Models (part 1) β discriminative vs generative vs conditional generative , density-normalization (values of compete for mass), Goodfellow 2017 taxonomy (Explicit: Tractable=Autoregressive / Approximate=VAE; Implicit: Direct=GAN / Indirect=Diffusion), Autoregressive MLE via chain rule β LLMs are autoregressive, PixelCNN (scanline-order 8-bit subpixels as 256-way softmax classification, exact , but 1024Γ1024 RGB = 3M sequential steps), (non-variational) autoencoder L2 reconstruction for feature learning and its generative failure (generating new is no easier than generating ), VAE fix β force from known prior , encoder + decoder where Gaussian-decoder log-likelihood reduces to L2, ELBO derivation (Bayes rule β multiply top/bottom by β three log terms β wrap in expectation β two KLs β drop the intractable posterior-KL ), training objective with [[notes/Reparametrization Trick|reparam ]], the two losses fight (reconstruction wants + unique per ; prior wants ), sampling β decoder, disentangling via diagonal prior (Kingma & Welling MNIST grid β walking / smoothly traces digit identity and style)
- Lec 14: Generative Models (part 2) β GAN minimax , alternating SGD with no single loss curve, saturation problem at start ( β flat gradient) and the non-saturating fix (train to maximize ), optimal β outer min achieves , DC-GAN (Radford ICLR 2016, who later did GPT-1/2), StyleGAN AdaIN with mapping + synthesis networks, latent-interpolation morphs, GAN era 2016β2021; Diffusion introduced via modern Rectified Flow (, , train β entire training loop is 5 lines, sampling is Euler-step backward steps), Classifier-Free Guidance (Ho & Salimans 2022 β randomly drop during training so the same net is conditional + unconditional, at sampling combine to extrapolate toward , doubles sampling cost), noise schedule trick (middle is hardest because of ambiguity β use logit-normal ), LDMs (compress to latents via VAE + GAN encoder/decoder β small KL weight + discriminator fixes blurry VAE decoder, then DiT denoises latents), DiT conditioning via predicted scale/shift (adaLN-Zero) or cross-attention, T2I (FLUX.1 β T5+CLIP, 12B DiT, 1024 tokens), T2V (Meta MovieGen β 30B DiT, 76K tokens, ), 2024 video-diffusion explosion (Sora/Gen3/MovieGen/Veo 2/Wan/Cosmos/Kling), distillation collapses 30β50 steps to 1, generalized diffusion template unifies Rectified Flow / Variance-Preserving / Variance-Exploding and x/Ξ΅/v-prediction targets, score-function view + reverse-SDE view + AR-strikes-back via discrete latents (VQ-VAE + AR Transformer)
7. 3D, Vision+Language, Robotics, HCAI (Lec 15β18)
- Lec 15: 3D Vision (Jiajun Wu) β four-quadrant taxonomy (explicit/implicit Γ parametric/non-parametric), explicit vs implicit surfaces (torus vs sphere ), level sets, CSG, SDF blending, datasets survey (ShapeNet 3M β ShapeNetCore 51K β Objaverse-XL 10M, CO3D, PartNet, ScanNet), task zoo ( generative vs discriminative), pipelines by representation: Multi-View CNN (Su ICCV 2015 β max-pool over rendered views, ~90% ModelNet40), voxel nets (3D ShapeNets / 3D-GAN / Visual Object Networks β differentiable projection for shape+texture edits) with octree sparsification (OctNet, O-CNN, OGN), PointNet (permutation + sampling invariance β with shared MLP + max pool; Chamfer + EMD for point-cloud losses; EdgeConv graph extension), AtlasNet (Groueix CVPR 2018 β MLPs parameterize patches), deep implicit functions (Occupancy Networks Mescheder CVPR 2019, DeepSDF Park CVPR 2019, LDIF Genova CVPR 2020 with local ellipsoid elements), NeRF Mildenhall ECCV 2020 (, volume rendering with ), 3D Gaussian Splatting Kerbl SIGGRAPH 2023 (sparse explicit Gaussian blobs, 137 FPS vs NeRF 0.07, ~2000Γ faster at comparable quality), structure-aware reps (part sets β relationship graphs β hierarchies β StructureNet hierarchical graphs Mo 2019 β programs)
- Lec 16: Vision-Language Models / Multi-Modal Foundation Models (Ranjay Krishna) β foundation-model taxonomy (Language: ELMo/BERT/GPT/T5; Classification: CLIP/CoCa; LM+Vision: LLaVA/Flamingo/GPT-4V/Gemini/Molmo; And More!: SAM/Whisper/DALL-E/Stable Diffusion/Imagen; Chaining: CuPL/VisProg), CLIP symmetric InfoNCE on 400M pairs + zero-shot via text-encoder-as-classifier + prompt eng (+1.3% βA photo of a Xβ, +5% multi-prompt mean) + OOD wins (Adversarial 2.7β77.1, Rendition 37.7β88.9 vs ResNet101 same 76.2 on ImageNet) at 307M vs 44.5M params, CLIP disadvantages (batch-size dep for fine-grained βWelsh Corgiβ@32K, compositionality fails on Winoground/CREPE/ARO/SugarCREPE, NegCLIP hard-positive collapse, image-level captions too coarse, CSAM in 5B datasets), CoCa adds Multimodal Text Decoder + captioning loss β 86.3% zero-shot / 91.0% finetuned, LLaVA (CLIP penultimate ViT layer not CLS β linear bridge β LLaMA, 3-stage recipe: init frozen + train linear + finetune both, >100K GPT-4-generated instruction tuples), Flamingo (frozen NFNet + frozen Chinchilla + Perceiver Resampler downsamples to fixed visual tokens + GATED XATTN-DENSE between LM blocks with
tanh(alpha)gates init at 0 so frozen LM preserved at step 0, interleaved<image><eos>with mask-to-most-recent β in-context few-shot), Molmo (fully open Sep 2024: weights+data+code+evals; PixMo 700K via spoken 60-90s annotations vs LLaMA 3.1Vβs 6B; outputs grounded<point x= y= alt=>; Elo 1076 = 2nd behind GPT-4o 1079; chains with SAM 2), SAM (heavy image encoder + light prompt encoder for points/box/mask/text + lightweight mask decoder, ambiguity β 3 valid masks + confidence with loss only on best match, SA-1B = 1B masks/11M images = 6Γ/400Γ OpenImages built via data-engine flywheel, zero-shot on bacteria/Van Gogh/produce), CuPL (GPT-3 βWhat does a {class} look like?β β CLIP, +0.65 ImageNet, +3.7 DTD, collapses 80β3 prompts), VisProg (Gupta 2023 β GPT writes Python that calls Loc/FaceDet/Seg/Select/Classify/Vqa/Replace/ColorPop/BgBlur/Tag/Emoji/Crop/List/Eval modules from in-context examples) - Lec 17: Robot Learning (Yunzhu Li, May 29 2025) β 7-section survey: problem formulation as agent β physical world with (state, action, reward) + casts of cart-pole/locomotion/Atari/Go/text-gen/chatbot/cloth-folding into the same loop; embodied / active / situated robot perception vs CV; RL structurally differs from SL via stochasticity / credit assignment / nondifferentiable / nonstationary, DQN Atari pipeline (Conv 4β16 8Γ8/s4 β Conv 16β32 4Γ4/s2 β FC-256 β FC-A on 4Γ84Γ84), DeepMind game milestones (AlphaGo Jan 2016 β AlphaGo Zero Oct 2017 β AlphaZero Dec 2018 β MuZero Nov 2019 + AlphaStar Vinyals Science 2018 + OpenAI Five Apr 2019), real-robot RL (ETH RSL Sci Robotics 2020, Unitree B2-W Dec 2024, OpenAI Rubikβs Cube 2019, Visual Dexterity Sci Robotics 2023), model-free is sample-inefficient (β3000 years in 40 daysβ) motivating model-based; model learning + receding-horizon planning with key choice of form: pixel dynamics (Deep Visual Foresight Finn & Levine ICRA 2017, CDNA conv+LSTM), keypoint dynamics (Manuelli/Li/Florence/Tedrake CoRL 2020 KUKA pushing Cheez-It), particle dynamics (Wang RSS 2023, RoboCook CoRL 2023 Best Systems β granola/rice/dough piles via GNN ); imitation learning flavors β BC (distribution shift / βno data on how to recoverβ car illustration), DAgger (iterative expert correction), IRL (RL: env+rewardβbehavior; IRL: env+behaviorβreward), Implicit BC ( for multi-modal actions), Diffusion Policy ( over iterations + action chunking for receding-horizon, beats LSTM-GMM/IBC/BET on multi-modality+commitment); robotic foundation models / LBMs = policy mapping (obs/state, goal) β action without explicit state/transition (RT-1 Dec 2022 β RT-2 Jul 2023 β RT-X Oct 2023 1M episodes/22 embodiments/527 skills/311 scenes/34 labs β OpenVLA Jun 2024 = Llama2-7B + DINOv2 + SigLIP + 7-DoF de-tokenizer β Ο-Zero Oct 2024 by Physical Intelligence with cross-embodiment dataset + zero-shot/specialized/efficient post-training tracks, open-sourced as
openpiFeb 4 2025 β Helix Figure / Hi-Robot PI / Gemini Robotics / GR00T Nvidia / DYNA-1); remaining challenges β eval is costly+noisy with weak training-loss correlation (ALOHA 2 fleet) and sim-to-real gap (no βImageNet of embodied AIβ; BEHAVIOR / Habitat 3.0 candidates), foundation policy β foundation world model (action-conditioned future prediction, DayDreamer / Nvidia Cosmos / 1X), VLM/LLM not tailored for embodiment (βRL from human feedbackβ β βRL from embodied feedbackβ; SAM/DINOv2 closer than GPT), adaptation/lifelong (BEHAVIOR-1K preference rank), system-level: Helix two-system 7B-VLM-at-7-9Hz GPU2 + 80M-Transformer-at-200Hz GPU1 vs Hi-Robot high-VLM emits low-level language commands to low-VLA - Lec 18: 3D Vision (slides credit Justin Johnson, presented by Fei-Fei Li / Ehsan Adeli / Chen Wang, Jun 4, 2024 β 2024 deck used as substitute since 2025 Human-Centered AI deck was not posted) β Recall 2D detection/segmentation hierarchy + video as 4D tensor; Multi-View CNN (Su ICCV 2015 β render N views β shared CNN1 β element-wise max-pool over views β CNN2 β softmax, ~90% ModelNet40); 5-rep taxonomy ( Implicit Surface); 2.5D β depth maps (Eigen+Fergus ICCV 2015, scale-invariant loss to handle scale ambiguity) + surface normals (cosine loss ); voxels β 3D ShapeNets pipeline ( 6Β³/5Β³/4Β³ conv FC class), 3D-R2N2 (Choy ECCV 2016, 2D CNN + 3D CNN decoder + per-voxel CE), float32 = 4 GB memory wall, OGN octree (Tatarchenko ICCV 2017) at ; pointclouds β Point Set Generation (Fan CVPR 2017, FC + conv heads with Chamfer loss ), PointNet applications (classification / semantic seg / part seg), DenseFusion (Wang CVPR 2019, RGB CNN per-pixel feat + PointNet per-point feat β project & concat β 6D pose); Triangle meshes β Pixel2Mesh (Wang ECCV 2018: ellipsoid template β iterative refinement 156β628β2466 verts, graph conv , vertex-aligned features via bilinear-sampled CNN feats
conv3_3/4_3/5_3, Chamfer loss on meshβpointcloud sampling), Mesh R-CNN (Gkioxari ICCV 2019, mesh head on Mask R-CNN); implicit (Ren Ng CS184/284A slides) β algebraic surfaces (zero set of poly), CSG ( trees on primitives), level sets (grid + trilinear interp where ), DeepSDF (Park CVPR 2019); NeRF variants β Nerfies (Park ICCV 2021 deformable), RawNeRF (Mildenhall CVPR 2022 HDR), BlockNeRF (Tancik CVPR 2022 SF tiling), cost: 1-2 days V100 train + 14.6M MLP forwards per render at 224 samples/pixel; 3D Gaussian Splatting vs NeRF (continuous MLP-along-ray vs blend-discrete-Gaussians-along-ray, hoursβminutes fitting + 10s/frameβreal-time render), Dynamic 3D Gaussians (Luiten 3DV 2024) + Gaussian Splatting SLAM (Matsuki CVPR 2024); foundation models for 3D β DreamFusion (Poole arXiv 2022, Score Distillation Sampling: optimize NeRF so renders match 2D text-to-image diffusion model), CAT3D (Gao arXiv 2024, multi-view diffusion β fit 3D)
Lessons
- Stage your forward and backward pass (I actually didnβt do this super well). I think this will make for more readable code in the future.

