CS231n β€” Deep Learning for Computer Vision

Stanford’s flagship deep-learning-for-vision course, originally designed by Andrej Karpathy. Light on theory, heavy on intuition built around backprop through computational graphs.

Resources

Lectures

Lec 1 β€” Course Intro and CV History

  • Course framing, biological vision, and the ImageNet-era story of computer vision.

Lec 2 β€” Image Classification

Lec 3 β€” Regularization and Optimization

  • Regularization, SGD issues, optimizer variants, and learning-rate schedules.

Lec 4 β€” Neural Networks and Backpropagation

  • Nonlinearity, computational graphs, gradient flow, and practical backprop patterns.

Lec 5 β€” Convolutional Neural Networks

  • Spatial structure, convolution shapes, receptive fields, pooling, and what CNN filters learn.

Lec 6 β€” Training CNNs and CNN Architectures

  • Normalization, dropout, activation trends, ImageNet architectures, and transfer-learning workflow.

Lec 7 β€” RNNs, LSTMs, and GRUs

  • Sequence patterns, BPTT, captioning, gradient issues, and gated recurrent models.

Lec 8 β€” Attention and Transformers

  • Seq2seq attention, self-attention, Transformer blocks, scaling, and ViTs.

Lec 9 β€” Segmentation, Detection, and Visualization

Lec 10 β€” Video Understanding

  • Video tensors, 3D CNNs, optical flow, video transformers, and efficient / multimodal video modeling.

Lec 11 β€” Large-Scale Distributed Training

  • Hardware, sharding, checkpointing, parallelism, and scaling recipes for very large models.

Lec 12 β€” Self-Supervised Learning

  • Pretext tasks, MAE, contrastive learning, SimCLR, MoCo, CPC, and DINO.

Lec 13 β€” Generative Models (Part 1)

  • Discriminative vs generative modeling, autoregressive models, autoencoders, and VAEs.

Lec 14 β€” Generative Models (Part 2)

  • GANs, diffusion, latent diffusion, DiT conditioning, and modern text-to-image / text-to-video trends.

Lec 15 β€” 3D Vision

  • 3D representations, NeRFs, Gaussian splatting, and structure-aware 3D reasoning.

Lec 16 β€” Vision-Language Models

  • CLIP, CoCa, LLaVA, Flamingo, SAM, and the broader multimodal foundation model landscape.

Lec 17 β€” Robot Learning

  • RL, model-based control, imitation learning, diffusion policies, and robot foundation models.

Lec 18 β€” 3D Vision Follow-Up

  • Depth, voxels, point clouds, meshes, NeRF variants, Gaussian splatting, and 3D foundation models.

Lessons

  • Stage your forward and backward pass. I did not do this super well, but it would make the code much more readable in the future.

Screenshots

Forward / backward pass reference:

Additional reference: