CS231n — Deep Learning for Computer Vision

Stanford’s flagship deep-learning-for-vision course, originally designed by Andrej Karpathy. Light on theory, heavy on intuition built around backprop through computational graphs.

Course page: https://cs231n.stanford.edu/
Schedule (2026): https://cs231n.stanford.edu/schedule.html
Online notes: https://cs231n.github.io/

Lectures

1. Foundations (Lec 1–4)

Lec 1: Course intro, history of CV — biological vision, Hubel & Wiesel, ImageNet
Lec 2: Image Classification — data-driven approach, kNN, hyperparameters & validation, Linear Classifier, Softmax vs SVM / Hinge Loss
Lec 3: Regularization (L1/L2/elastic net) & Optimization — SGD problems (conditioning, saddle points, noise), SGD+Momentum, RMSProp, Adam, AdamW, LR schedules
Lec 4: Neural Networks & Backpropagation — feature transforms / why nonlinearity, computational graphs, gradient-flow patterns (add/mul/copy/max gates), modular forward/backward API, vector & matrix backprop with implicit Jacobians, SiLU activations

2. CNNs and CNN Architectures (Lec 5–6)

Lec 5: Convolutional Neural Networks — why preserve spatial structure, conv layer shapes ( $C_{o u t} \times C_{in} \times K \times K$ ), padding/stride formulas, receptive fields ( $1 + L (K - 1)$ ), 1×1 conv, pooling, translation equivariance vs invariance, what filters learn
Lec 6: Training CNNs & CNN Architectures — LayerNorm / BatchNorm / Dropout, SiLU, ImageNet history (AlexNet → VGG → GoogLeNet → ResNet), VGG’s all-3×3 design, ResNet residual block ( $H (x) = F (x) + x$ ), Kaiming init, preprocessing & augmentation, transfer learning (similarity × data-size matrix), hyperparam workflow + loss curve diagnostics

3. Sequence models (Lec 7–8)

Lec 7: RNNs, LSTM, GRU — sequence patterns (1-to-many / many-to-1 / many-to-many / seq2seq), vanilla RNN $h_{t} = tanh (W_{hh} h_{t - 1} + W_{x h} x_{t})$ , char-level LM + min-char-rnn, BPTT + truncated BPTT, image captioning (CNN→RNN with $W_{ih} v$ injection), multilayer RNNs, vanilla RNN gradient flow (product of $W_{hh}$ ’s → vanish/explode), gradient clipping, LSTM (i/f/o/g gates, $c_{t} = f ⊙ c_{t - 1} + i ⊙ g$ uninterrupted gradient path ≈ ResNet), GRU, Seq2Seq encoder-decoder
Lec 8: Attention & Transformers — Bahdanau attention from seq2seq bottleneck, generalized attention layer (scaled dot product, separate K/V), self-attention as a permutation-equivariant set operator, masked SA, multi-head SA as 4 matmuls ( $O (N^{2})$ memory, Flash Attention fix), Transformer block (SA + MLP, residuals, LN, 6 matmuls), scale chart 213M → 175B, LLM recipe (embedding → masked Transformer → projection → softmax), modern tweaks (Pre-Norm, RMSNorm, SwiGLU, MoE), Vision Transformer (16×16 patches → linear projection ≡ strided conv → 2D positional encoding → unmasked self-attention → average pool + classifier)

4. Detection, Segmentation, Visualization (Lec 9)

Lec 9: Semantic Segmentation (sliding window → fully convolutional, encoder-decoder with unpooling / transposed conv, U-Net copy-and-crop), Object Detection (single-object multitask loss → multiple-object problem → Selective Search + R-CNN → Fast R-CNN with RoI Pool/Align → Faster R-CNN with RPN/anchors → single-stage YOLO/SSD/RetinaNet → DETR set prediction), instance segmentation (Mask R-CNN = Faster R-CNN + per-class 28×28 mask head, also pose), Feature Visualization (first-layer filters, saliency via backprop, guided backprop, CAM, Grad-CAM, ViT attention)

5. Video, Distributed Training, Self-Supervised (Lec 10–12)

Lec 10: Video understanding — video as 4D tensor ( $T \times 3 \times H \times W$ , train short clips, test-time clip ensemble), 3D CNN (“Slow Fusion”) with shift-invariance comparison, C3D (“VGG of 3D CNNs”, 39.5 GFLOP), Sports-1M, recognizing actions from motion (Johansson 1973), Optical Flow + Two-Stream Networks (temporal stream beats spatial on UCF-101), CNN+LSTM combos and Recurrent Convolutional Network (Ballas — replace matmul with 2D conv), Nonlocal Block (Wang CVPR 2018, drop-in to 3D CNNs via 1×1×1 Q/K/V convs), I3D — Inflating 2D Networks to 3D (Carreira & Zisserman, copy-and-divide-by- $K_{t}$ init trick), Video Transformers (factorized attention ViViT/TimeSformer, pooling MViTv2, masked autoencoders VideoMAE/V2), Kinetics-400 climb I3D 71.1 → SlowFast+NL 79.8 → MViTv2-L 86.1 → VideoMAE V2-g 90, visualizing video models (Appearance vs Slow vs Fast motion), Temporal Action Localization (Faster-R-CNN-style temporal proposals), Spatio-Temporal Detection (AVA dataset), audio-visual video understanding (McGurk, visually-guided audio source separation), efficient video (MoViNets, X3D, SCSampler, AdaMML, Listen to Look), egocentric (Project Aria), Video + LLMs (Video-LLaVA, Video-ChatGPT, VideoLLaMA 3)
Lec 11: Large-scale distributed training — H100 hardware (HBM 80GB/3352 GB/s, 132 SMs, Tensor Cores 4096 FLOP/cycle), K40→B200 FLOPs timeline, Data Parallelism → FSDP (ZeRO sharding of params/grads/Adam states, all_gather + reduce_scatter per layer) → HSDP (intra-node FSDP + inter-node DP), Llama3-405B memory arithmetic (800GB → 10GB/GPU), activation checkpointing ( $N$ -optimum compute/memory tradeoff), HFU vs MFU (>30% good, >40% excellent; GPT-3 21%, PaLM 46%), ND parallelism over (Batch, Seq, Dim) = DP/CP/PP/TP, Context Parallelism (Ring Attention, DeepSpeed Ulysses), Pipeline Parallelism (GPipe bubble + microbatches, $NM / (NM + N - 1)$ active fraction), Tensor Parallelism (column/row sharding two-layer no-comm trick), Llama3-405B 4D recipe table (8K→16K GPUs, seq 8192→131072, MFU 43%→41%→38%)
Lec 12: Self-Supervised Learning — pretext tasks (rotation prediction / Gidaris 2018, relative patch location / Doersch 2015, jigsaw / Noroozi 2016, inpainting via Context Encoders / Pathak 2016, colorization / split-brain autoencoder Zhang 2017, video coloring Vondrick 2018), Masked Autoencoders (MAE) (75% masking, asymmetric encoder on visible-only patches + lightweight decoder, MSE on masked patches, ViT-H 448 → 87.8% ImageNet), contrastive learning framework (attract $x / x^{+}$ , repel $x^{-}$ , InfoNCE $L = - E [lo g \frac{e x p s ( f ( x ) , f ( x ^{+} ))}{e x p s ( f ( x ) , f ( x ^{+} )) + \sum e x p s ( f ( x ) , f ( x _{j}^{-} ))}]$ as $N$ -way softmax, MI lower bound $M I [f (x), f (x^{+})] - lo g N \geq - L$ ), SimCLR (2 $N$ batch, cosine similarity affinity matrix, non-linear projection $g (\cdot)$ thrown away post-train, large batch crucial — 8192 on TPU), MoCo (FIFO queue of negatives + momentum encoder $θ_{k} \leftarrow m θ_{k} + (1 - m) θ_{q}$ , decouples batch size from#negatives, MoCo-v2 = MoCo queue + SimCLR MLP head + strong aug, 67.5% vs SimCLR 66.6% at 256-batch), instance-level (SimCLR/MoCo) vs sequence-level contrastive, CPC (van den Oord 2018 — encode $z_{t} = g_{enc} (x_{t})$ , summarize context $c_{t} = g_{ar} (z_{\leq t})$ via GRU-RNN, InfoNCE with time-dependent score $s_{k} (z_{t + k}, c_{t}) = z_{t + k}^{T} W_{k} c_{t}$ , applies to audio / image patches), DINO (self-distillation, student $g_{θ_{s}}$ + teacher $g_{θ_{t}}$ from EMA, cross-entropy $- p_{2} lo g p_{1}$ with teacher centering + sharpening to prevent collapse, ViT 8×8 patches → emergent unsupervised object segmentation)

6. Generative Models (Lec 13–14)

Lec 13: Generative Models (part 1) — discriminative $p (y ∣ x)$ vs generative $p (x)$ vs conditional generative $p (x ∣ y)$ , density-normalization $\int p (x) d x = 1$ (values of $x$ compete for mass), Goodfellow 2017 taxonomy (Explicit: Tractable=Autoregressive / Approximate=VAE; Implicit: Direct=GAN / Indirect=Diffusion), Autoregressive MLE $W^{*} = ar g max_{W} \sum_{i} lo g f (x^{(i)}, W)$ via chain rule $p (x) = \prod_{t} p (x_{t} ∣ x_{< t})$ — LLMs are autoregressive, PixelCNN (scanline-order 8-bit subpixels as 256-way softmax classification, exact $p (x)$ , but 1024×1024 RGB = 3M sequential steps), (non-variational) autoencoder L2 reconstruction for feature learning and its generative failure (generating new $z$ is no easier than generating $x$ ), VAE fix — force $z$ from known prior $N (0, I)$ , encoder $q_{ϕ} (z ∣ x) = N (μ_{z ∣ x}, Σ_{z ∣ x})$ + decoder $p_{θ} (x ∣ z) = N (μ_{x ∣ z}, σ^{2})$ where Gaussian-decoder log-likelihood reduces to L2, ELBO derivation (Bayes rule → multiply top/bottom by $q_{ϕ}$ → three log terms → wrap in expectation → two KLs → drop the intractable posterior-KL $\geq 0$ ), training objective $E_{q_{ϕ}} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))$ with [[notes/Reparametrization Trick|reparam $z = μ + ϵ ⊙ Σ^{1/2}$ ]], the two losses fight (reconstruction wants $Σ \to 0$ + unique $μ$ per $x$ ; prior wants $Σ = I, μ = 0$ ), sampling $z \sim N (0, I)$ → decoder, disentangling via diagonal prior (Kingma & Welling MNIST grid — walking $z_{1}$ / $z_{2}$ smoothly traces digit identity and style)
Lec 14: Generative Models (part 2) — GAN minimax $min_{G} max_{D} E_{p_{data}} [lo g D (x)] + E_{p (z)} [lo g (1 - D (G (z)))]$ , alternating SGD with no single loss curve, saturation problem at start ( $D (G (z)) \approx 0$ → flat $lo g (1 - D)$ gradient) and the non-saturating fix (train $G$ to maximize $lo g D (G (z))$ ), optimal $D_{G}^{*} (x) = p_{data} (x) / (p_{data} (x) + p_{G} (x))$ → outer min achieves $p_{G} = p_{data}$ , DC-GAN (Radford ICLR 2016, who later did GPT-1/2), StyleGAN AdaIN $w_{i} \cdot (x_{i} - μ) / σ + b_{i}$ with mapping + synthesis networks, latent-interpolation morphs, GAN era 2016–2021; Diffusion introduced via modern Rectified Flow ( $x_{t} = (1 - t) x + t z$ , $v_{gt} = z - x$ , train $L = ∥ f_{θ} (x_{t}, t) - v ∥^{2}$ — entire training loop is 5 lines, sampling is Euler-step backward $T \approx 50$ steps), Classifier-Free Guidance (Ho & Salimans 2022 — randomly drop $y \to y_{\emptyset}$ during training so the same net is conditional + unconditional, at sampling combine $v^{cfg} = (1 + w) v^{y} - w v^{\emptyset}$ to extrapolate toward $p (x ∣ y)$ , doubles sampling cost), noise schedule trick (middle $t$ is hardest because of $(x, z)$ ambiguity → use logit-normal $t = sigmoid (randn)$ ), LDMs (compress to $32 \times 32 \times 16$ latents via VAE + GAN encoder/decoder — small KL weight + discriminator fixes blurry VAE decoder, then DiT denoises latents), DiT conditioning via predicted scale/shift (adaLN-Zero) or cross-attention, T2I (FLUX.1 — T5+CLIP, 12B DiT, 1024 tokens), T2V (Meta MovieGen — 30B DiT, 76K tokens, $T \times H \times W \times 3$ ), 2024 video-diffusion explosion (Sora/Gen3/MovieGen/Veo 2/Wan/Cosmos/Kling), distillation collapses 30–50 steps to 1, generalized diffusion template $x_{t} = a (t) x + b (t) z$ unifies Rectified Flow / Variance-Preserving / Variance-Exploding and x/ε/v-prediction targets, score-function view $s (x) = \nabla_{x} lo g p (x)$ + reverse-SDE view + AR-strikes-back via discrete latents (VQ-VAE + AR Transformer)

7. 3D, Vision+Language, Robotics, HCAI (Lec 15–18)

Lec 15: 3D Vision (Jiajun Wu) — four-quadrant taxonomy (explicit/implicit × parametric/non-parametric), explicit vs implicit surfaces (torus $f (u, v)$ vs sphere $x^{2} + y^{2} + z^{2} - 1 = 0$ ), level sets, CSG, SDF blending, datasets survey (ShapeNet 3M → ShapeNetCore 51K → Objaverse-XL 10M, CO3D, PartNet, ScanNet), task zoo ( $P (S)$ generative vs $P (c ∣ S)$ discriminative), pipelines by representation: Multi-View CNN (Su ICCV 2015 — max-pool over rendered views, ~90% ModelNet40), voxel nets (3D ShapeNets / 3D-GAN / Visual Object Networks — differentiable projection for shape+texture edits) with octree sparsification (OctNet, O-CNN, OGN), PointNet (permutation + sampling invariance → $γ \circ g \circ h$ with shared MLP + max pool; Chamfer + EMD for point-cloud losses; EdgeConv graph extension), AtlasNet (Groueix CVPR 2018 — $K$ MLPs $R^{2} \to R^{3}$ parameterize patches), deep implicit functions (Occupancy Networks Mescheder CVPR 2019, DeepSDF Park CVPR 2019, LDIF Genova CVPR 2020 with local ellipsoid elements), NeRF Mildenhall ECCV 2020 ( $F_{Θ} (x, y, z, θ, ϕ) \to (RGB, σ)$ , volume rendering $c \approx \sum T_{i} α_{i} c_{i}$ with $T_{i} = \prod (1 - α_{j})$ ), 3D Gaussian Splatting Kerbl SIGGRAPH 2023 (sparse explicit Gaussian blobs, 137 FPS vs NeRF 0.07, ~2000× faster at comparable quality), structure-aware reps (part sets → relationship graphs → hierarchies → StructureNet hierarchical graphs Mo 2019 → programs)
Lec 16: Vision-Language Models / Multi-Modal Foundation Models (Ranjay Krishna) — foundation-model taxonomy (Language: ELMo/BERT/GPT/T5; Classification: CLIP/CoCa; LM+Vision: LLaVA/Flamingo/GPT-4V/Gemini/Molmo; And More!: SAM/Whisper/DALL-E/Stable Diffusion/Imagen; Chaining: CuPL/VisProg), CLIP symmetric InfoNCE on 400M pairs + zero-shot via text-encoder-as-classifier + prompt eng (+1.3% “A photo of a X”, +5% multi-prompt mean) + OOD wins (Adversarial 2.7→77.1, Rendition 37.7→88.9 vs ResNet101 same 76.2 on ImageNet) at 307M vs 44.5M params, CLIP disadvantages (batch-size dep for fine-grained “Welsh Corgi”@32K, compositionality fails on Winoground/CREPE/ARO/SugarCREPE, NegCLIP hard-positive collapse, image-level captions too coarse, CSAM in 5B datasets), CoCa adds Multimodal Text Decoder + captioning loss → 86.3% zero-shot / 91.0% finetuned, LLaVA (CLIP penultimate ViT layer not CLS → linear bridge → LLaMA, 3-stage recipe: init frozen + train linear + finetune both, >100K GPT-4-generated instruction tuples), Flamingo (frozen NFNet + frozen Chinchilla + Perceiver Resampler downsamples to fixed visual tokens + GATED XATTN-DENSE between LM blocks with tanh(alpha) gates init at 0 so frozen LM preserved at step 0, interleaved <image><eos> with mask-to-most-recent → in-context few-shot), Molmo (fully open Sep 2024: weights+data+code+evals; PixMo 700K via spoken 60-90s annotations vs LLaMA 3.1V’s 6B; outputs grounded <point x= y= alt=>; Elo 1076 = 2nd behind GPT-4o 1079; chains with SAM 2), SAM (heavy image encoder + light prompt encoder for points/box/mask/text + lightweight mask decoder, ambiguity → 3 valid masks + confidence with loss only on best match, SA-1B = 1B masks/11M images = 6×/400× OpenImages built via data-engine flywheel, zero-shot on bacteria/Van Gogh/produce), CuPL (GPT-3 “What does a {class} look like?” → CLIP, +0.65 ImageNet, +3.7 DTD, collapses 80→3 prompts), VisProg (Gupta 2023 — GPT writes Python that calls Loc/FaceDet/Seg/Select/Classify/Vqa/Replace/ColorPop/BgBlur/Tag/Emoji/Crop/List/Eval modules from in-context examples)
Lec 17: Robot Learning (Yunzhu Li, May 29 2025) — 7-section survey: problem formulation as agent ↔ physical world with (state, action, reward) + casts of cart-pole/locomotion/Atari/Go/text-gen/chatbot/cloth-folding into the same loop; embodied / active / situated robot perception vs CV; RL structurally differs from SL via stochasticity / credit assignment / nondifferentiable / nonstationary, DQN Atari pipeline (Conv 4→16 8×8/s4 → Conv 16→32 4×4/s2 → FC-256 → FC-A on 4×84×84), DeepMind game milestones (AlphaGo Jan 2016 → AlphaGo Zero Oct 2017 → AlphaZero Dec 2018 → MuZero Nov 2019 + AlphaStar Vinyals Science 2018 + OpenAI Five Apr 2019), real-robot RL (ETH RSL Sci Robotics 2020, Unitree B2-W Dec 2024, OpenAI Rubik’s Cube 2019, Visual Dexterity Sci Robotics 2023), model-free is sample-inefficient (“3000 years in 40 days”) motivating model-based; model learning + receding-horizon planning with key choice of $s_{t}$ form: pixel dynamics (Deep Visual Foresight Finn & Levine ICRA 2017, CDNA conv+LSTM), keypoint dynamics (Manuelli/Li/Florence/Tedrake CoRL 2020 KUKA pushing Cheez-It), particle dynamics (Wang RSS 2023, RoboCook CoRL 2023 Best Systems — granola/rice/dough piles via GNN $s_{0} a_{0}, GNN s_{1} \dots$ ); imitation learning flavors — BC (distribution shift / “no data on how to recover” car illustration), DAgger (iterative expert correction), IRL (RL: env+reward→behavior; IRL: env+behavior→reward), Implicit BC ( $ar g min_{a} E_{θ} (o, a)$ for multi-modal actions), Diffusion Policy ( $ε_{θ} (o, a)$ over $K$ iterations + action chunking for receding-horizon, beats LSTM-GMM/IBC/BET on multi-modality+commitment); robotic foundation models / LBMs = policy mapping (obs/state, goal) → action without explicit state/transition (RT-1 Dec 2022 → RT-2 Jul 2023 → RT-X Oct 2023 1M episodes/22 embodiments/527 skills/311 scenes/34 labs → OpenVLA Jun 2024 = Llama2-7B + DINOv2 + SigLIP + 7-DoF de-tokenizer → π-Zero Oct 2024 by Physical Intelligence with cross-embodiment dataset + zero-shot/specialized/efficient post-training tracks, open-sourced as openpi Feb 4 2025 → Helix Figure / Hi-Robot PI / Gemini Robotics / GR00T Nvidia / DYNA-1); remaining challenges — eval is costly+noisy with weak training-loss correlation (ALOHA 2 fleet) and sim-to-real gap (no “ImageNet of embodied AI”; BEHAVIOR / Habitat 3.0 candidates), foundation policy ↔ foundation world model (action-conditioned future prediction, DayDreamer / Nvidia Cosmos / 1X), VLM/LLM not tailored for embodiment (“RL from human feedback” → “RL from embodied feedback”; SAM/DINOv2 closer than GPT), adaptation/lifelong (BEHAVIOR-1K preference rank), system-level: Helix two-system 7B-VLM-at-7-9Hz GPU2 + 80M-Transformer-at-200Hz GPU1 vs Hi-Robot high-VLM emits low-level language commands to low-VLA
Lec 18: 3D Vision (slides credit Justin Johnson, presented by Fei-Fei Li / Ehsan Adeli / Chen Wang, Jun 4, 2024 — 2024 deck used as substitute since 2025 Human-Centered AI deck was not posted) — Recall 2D detection/segmentation hierarchy + video as 4D tensor; Multi-View CNN (Su ICCV 2015 — render N views → shared CNN1 → element-wise max-pool over views → CNN2 → softmax, ~90% ModelNet40); 5-rep taxonomy ( Implicit Surface); 2.5D — depth maps (Eigen+Fergus ICCV 2015, scale-invariant loss $D (y, y^{*}) = \frac{1}{2 n ^{2}} \sum (lo g y_{i} - lo g y_{j} - lo g y_{i}^{*} + lo g y_{j}^{*})^{2}$ to handle scale ambiguity) + surface normals (cosine loss $(x \cdot y) / (∥ x ∥∥ y ∥)$ ); voxels — 3D ShapeNets pipeline ( $1 \times 3 0^{3} \to$ 6³/5³/4³ conv $\to$ FC $\to$ class), 3D-R2N2 (Choy ECCV 2016, 2D CNN + 3D CNN decoder + per-voxel CE), $102 4^{3}$ float32 = 4 GB memory wall, OGN octree (Tatarchenko ICCV 2017) at $3 2^{3} /6 4^{3} /12 8^{3}$ ; pointclouds — Point Set Generation (Fan CVPR 2017, FC + conv heads with Chamfer loss $d_{C D} = \sum_{x \in S_{1}} min_{y \in S_{2}} ∥ x - y ∥^{2} + \sum_{y \in S_{2}} min_{x \in S_{1}} ∥ x - y ∥^{2}$ ), PointNet applications (classification / semantic seg / part seg), DenseFusion (Wang CVPR 2019, RGB CNN per-pixel feat + PointNet per-point feat → project & concat → 6D pose); Triangle meshes — Pixel2Mesh (Wang ECCV 2018: ellipsoid template → iterative refinement 156→628→2466 verts, graph conv $f_{i}^{'} = W_{0} f_{i} + \sum_{j \in N (i)} W_{1} f_{j}$ , vertex-aligned features via bilinear-sampled CNN feats conv3_3/4_3/5_3, Chamfer loss on mesh→pointcloud sampling), Mesh R-CNN (Gkioxari ICCV 2019, mesh head on Mask R-CNN); implicit (Ren Ng CS184/284A slides) — algebraic surfaces (zero set of poly), CSG ( $\cup / \cap / ∖$ trees on primitives), level sets (grid + trilinear interp where $f (x) = 0$ ), DeepSDF (Park CVPR 2019); NeRF variants — Nerfies (Park ICCV 2021 deformable), RawNeRF (Mildenhall CVPR 2022 HDR), BlockNeRF (Tancik CVPR 2022 SF tiling), cost: 1-2 days V100 train + 14.6M MLP forwards per $25 6^{2}$ render at 224 samples/pixel; 3D Gaussian Splatting vs NeRF (continuous MLP-along-ray vs blend-discrete-Gaussians-along-ray, hours→minutes fitting + 10s/frame→real-time render), Dynamic 3D Gaussians (Luiten 3DV 2024) + Gaussian Splatting SLAM (Matsuki CVPR 2024); foundation models for 3D — DreamFusion (Poole arXiv 2022, Score Distillation Sampling: optimize NeRF so renders match 2D text-to-image diffusion model), CAT3D (Gao arXiv 2024, multi-view diffusion → fit 3D)

Lessons

Stage your forward and backward pass (I actually didn’t do this super well). I think this will make for more readable code in the future.

🛠️ Steven Gong

Table of Contents

CS231n — Deep Learning for Computer Vision

Lectures

1. Foundations (Lec 1–4)

2. CNNs and CNN Architectures (Lec 5–6)

3. Sequence models (Lec 7–8)

4. Detection, Segmentation, Visualization (Lec 9)

5. Video, Distributed Training, Self-Supervised (Lec 10–12)

6. Generative Models (Lec 13–14)

7. 3D, Vision+Language, Robotics, HCAI (Lec 15–18)

Lessons

Graph View

Backlinks