WATonomous

Action Classification

There is difference between this and Object Tracking, although they seem to overlap.

We need action classification to know the intent, for example, of passengers. Are they just stopped, or do they intend to cross (in which case we should yield the right of way).

Papers

Video classification (CS231n 2025 Lec 10)

A video is a 4D tensor (or ). Raw video is huge — uncompressed SD at 30fps eats ~1.5GB/min, HD ~10GB/min. Standard fix: train on short low-res clips ( frames at ), then ensemble at test time by averaging predictions over multiple clips sampled from the long video.

Action recognition = video classification: input video → label like {Swimming, Running, Jumping, Eating, Standing}. Two extensions: Temporal Action Localization (Chao et al CVPR 2018, untrimmed video → time-bounded action segments — Faster-R-CNN-style temporal proposals + classify) and Spatio-Temporal Detection (AVA dataset, Gu et al CVPR 2018 — bbox in space and time, classify atomic actions like “clink glass → drink”, “grab → hug”).

Baseline architectures

Each fuses time differently:

baselineinput shapehow time enterswhat it learns
Single-Frame CNN per framenone — score frames independently, averageshockingly strong; recognizes scene/objects per frame
Late Fusion (FC) frames → CNN per frameconcat per-frame features → FCglobal temporal mixing at the end
Late Fusion (pooling)sameaverage per-frame features → linearparameter-light variant
Early Fusionreshape to first 2D conv with filter sees all frames at oncetemporal info collapsed in layer 1; no temporal shift-invariance
3D CNN (“Slow Fusion”)every layer is 3D conv + 3D poolslowly fuses time across depth; temporal shift-invariant because the 3D kernel slides over

Why 3D wins: a 2D Early-Fusion conv has weights of shape — the same kernel applied at every spatial location, but a fixed slot per time index. Shift the action in time and the response changes. A 3D conv slides over time too, so a temporal shift produces a temporally-shifted activation map.

C3D — “VGG of 3D CNNs” (Tran et al ICCV 2015)

All-3×3×3 conv + 2×2×2 pool throughout (Pool1 is to preserve early temporal resolution). Ends with two FC-4096 → FC-. Total: 39.5 GFLOP/clip () — about VGG-16’s 13.6 GFLOP, AlexNet’s 0.7 GFLOP. Trained on Sports-1M (1M YouTube videos, 487 sport categories — Karpathy CVPR 2014).

Sports-1M Top-5 accuracy: Single-Frame 77.7 → Early Fusion 76.8 → Late Fusion 78.7 → 3D CNN 80.2 → C3D 84.4.

Inflating 2D networks to 3D — I3D (Carreira & Zisserman CVPR 2017, “Quo Vadis”)

Take any 2D image architecture, replace each conv/pool with a 3D version. So Inception’s becomes , becomes , etc.

Initialization trick: copy 2D pretrained weights times along the time axis and divide by . On a “constant” video (all frames identical), this 3D conv produces exactly the same output as the original 2D conv on one frame — so an ImageNet-pretrained image net becomes a sensible video init for free.

Kinetics-400 results table

Train-from-scratch / ImageNet-pretrained:

modelscratch+ ImageNet pretrain
Per-frame CNN57.962.2
CNN + LSTM53.963.3
Two-Stream62.865.6
I3D (inflated 3D CNN)68.471.1
Two-stream I3D71.674.2

Post-2017 climb on Kinetics-400: I3D 71.1 → SlowFast (+ Nonlocal) 79.8 → MViTv2-L 86.1 → VideoMAE V2-g 90. The progression mirrors image classification: better backbones (3D-ResNet → SlowFast → ViT-style) win.

Visualizing video models (Feichtenhofer CVPR 2018 / IJCV 2019)

Backprop through a two-stream net with a spatial-smoothness penalty on the optical-flow input, separate the visualization into Appearance (RGB stream) vs “Slow” motion vs “Fast” motion (different temporal-frequency tunings of the flow stream). For class “Weightlifting”: Appearance shows stacked dumbbells, Slow shows a “bar shaking” pattern, Fast shows a “push overhead” pattern — different speeds reveal different action sub-phases.

Source

CS231n 2025 Lec 10 slides 1–86 (video tensor format, training on short clips, single-frame / late / early / 3D CNN baselines with shape and shift-invariance comparisons, Sports-1M results, C3D layer table and FLOPs, Recognizing actions from motion, Two-Stream and I3D results on Kinetics-400, visualization slides 79–83, temporal action localization slide 85, AVA spatio-temporal detection slide 86). 2026 PDF not published — using 2025 fallback (May 1, 2025).