Action Classification

There is difference between this and Object Tracking, although they seem to overlap.

We need action classification to know the intent, for example, of passengers. Are they just stopped, or do they intend to cross (in which case we should yield the right of way).

Papers

SlowFast

Video classification (CS231n 2025 Lec 10)

A video is a 4D tensor $T \times 3 \times H \times W$ (or $3 \times T \times H \times W$ ). Raw video is huge — uncompressed SD at 30fps eats ~1.5GB/min, HD ~10GB/min. Standard fix: train on short low-res clips ( $T = 16$ frames at $H = W = 112$ ), then ensemble at test time by averaging predictions over multiple clips sampled from the long video.

Action recognition = video classification: input video → label like {Swimming, Running, Jumping, Eating, Standing}. Two extensions: Temporal Action Localization (Chao et al CVPR 2018, untrimmed video → time-bounded action segments — Faster-R-CNN-style temporal proposals + classify) and Spatio-Temporal Detection (AVA dataset, Gu et al CVPR 2018 — bbox in space and time, classify atomic actions like “clink glass → drink”, “grab → hug”).

Baseline architectures

Each fuses time differently:

baseline	input shape	how time enters	what it learns
Single-Frame CNN	$3 \times H \times W$ per frame	none — score frames independently, average	shockingly strong; recognizes scene/objects per frame
Late Fusion (FC)	$T$ frames → CNN per frame	concat per-frame features → FC	global temporal mixing at the end
Late Fusion (pooling)	same	average per-frame features → linear	parameter-light variant
Early Fusion	reshape to $3 T \times H \times W$	first 2D conv with $C_{o u t} \times 3 T \times K \times K$ filter sees all frames at once	temporal info collapsed in layer 1; no temporal shift-invariance
3D CNN (“Slow Fusion”)	$3 \times T \times H \times W$	every layer is 3D conv $C_{o u t} \times C_{in} \times 3 \times 3 \times 3$ + 3D pool	slowly fuses time across depth; temporal shift-invariant because the 3D kernel slides over $T$

Why 3D wins: a 2D Early-Fusion conv has weights of shape $C_{o u t} \times C_{in} \times T \times 3 \times 3$ — the same kernel applied at every spatial location, but a fixed slot per time index. Shift the action in time and the response changes. A 3D conv $C_{o u t} \times C_{in} \times 3 \times 3 \times 3$ slides over time too, so a temporal shift produces a temporally-shifted activation map.

C3D — “VGG of 3D CNNs” (Tran et al ICCV 2015)

All-3×3×3 conv + 2×2×2 pool throughout (Pool1 is $1 \times 2 \times 2$ to preserve early temporal resolution). Ends with two FC-4096 → FC- $C$ . Total: 39.5 GFLOP/clip ( $T = 16, 112 \times 112$ ) — about $2.9 \times$ VGG-16’s 13.6 GFLOP, $56 \times$ AlexNet’s 0.7 GFLOP. Trained on Sports-1M (1M YouTube videos, 487 sport categories — Karpathy CVPR 2014).

Sports-1M Top-5 accuracy: Single-Frame 77.7 → Early Fusion 76.8 → Late Fusion 78.7 → 3D CNN 80.2 → C3D 84.4.

Inflating 2D networks to 3D — I3D (Carreira & Zisserman CVPR 2017, “Quo Vadis”)

Take any 2D image architecture, replace each $K_{h} \times K_{w}$ conv/pool with a 3D $K_{t} \times K_{h} \times K_{w}$ version. So Inception’s $5 \times 5$ becomes $5 \times 5 \times 5$ , $3 \times 3$ becomes $3 \times 3 \times 3$ , etc.

Initialization trick: copy 2D pretrained weights $K_{t}$ times along the time axis and divide by $K_{t}$ . On a “constant” video (all frames identical), this 3D conv produces exactly the same output as the original 2D conv on one frame — so an ImageNet-pretrained image net becomes a sensible video init for free.

Kinetics-400 results table

Train-from-scratch / ImageNet-pretrained:

model	scratch	+ ImageNet pretrain
Per-frame CNN	57.9	62.2
CNN + LSTM	53.9	63.3
Two-Stream	62.8	65.6
I3D (inflated 3D CNN)	68.4	71.1
Two-stream I3D	71.6	74.2

Post-2017 climb on Kinetics-400: I3D 71.1 → SlowFast (+ Nonlocal) 79.8 → MViTv2-L 86.1 → VideoMAE V2-g 90. The progression mirrors image classification: better backbones (3D-ResNet → SlowFast → ViT-style) win.

Visualizing video models (Feichtenhofer CVPR 2018 / IJCV 2019)

Backprop through a two-stream net with a spatial-smoothness penalty on the optical-flow input, separate the visualization into Appearance (RGB stream) vs “Slow” motion vs “Fast” motion (different temporal-frequency tunings of the flow stream). For class “Weightlifting”: Appearance shows stacked dumbbells, Slow shows a “bar shaking” pattern, Fast shows a “push overhead” pattern — different speeds reveal different action sub-phases.

Source

CS231n 2025 Lec 10 slides 1–86 (video tensor format, training on short clips, single-frame / late / early / 3D CNN baselines with shape and shift-invariance comparisons, Sports-1M results, C3D layer table and FLOPs, Recognizing actions from motion, Two-Stream and I3D results on Kinetics-400, visualization slides 79–83, temporal action localization slide 85, AVA spatio-temporal detection slide 86). 2026 PDF not published — using 2025 fallback (May 1, 2025).

🛠️ Steven Gong

Table of Contents

Action Classification

Papers

Video classification (CS231n 2025 Lec 10)

Baseline architectures

C3D — “VGG of 3D CNNs” (Tran et al ICCV 2015)

Inflating 2D networks to 3D — I3D (Carreira & Zisserman CVPR 2017, “Quo Vadis”)

Kinetics-400 results table

Visualizing video models (Feichtenhofer CVPR 2018 / IJCV 2019)

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Action Classification

Papers

Video classification (CS231n 2025 Lec 10)

Baseline architectures

C3D — “VGG of 3D CNNs” (Tran et al ICCV 2015)

Inflating 2D networks to 3D — I3D (Carreira & Zisserman CVPR 2017, “Quo Vadis”)

Kinetics-400 results table

Visualizing video models (Feichtenhofer CVPR 2018 / IJCV 2019)

Source

Related

Graph View

Backlinks