Feature Visualization

What does a CNN actually look at? A grab-bag of techniques for inspecting trained CNNs — where in the image the model attends, what each neuron responds to, which pixels swing the prediction.

Why care?

A trained classifier is a black box that outputs a class label. Feature visualization opens it up: it reveals failure modes (the network classifies “wolf” because of the snowy background, not the animal), validates that the model learned semantically meaningful features, and powers downstream tools like weakly-supervised localization (object localization for free, just from classification labels).

Tier 1: Looking at weights directly

The first conv layer’s filters have shape $C_{o u t} \times 3 \times K \times K$ — directly visualizable as $K \times K$ RGB images. For AlexNet ( $64 \times 3 \times 11 \times 11$ ), ResNet-18/101 ( $64 \times 3 \times 7 \times 7$ ), DenseNet-121 ( $64 \times 3 \times 7 \times 7$ ), the learned filters look almost identical: oriented edges, opposing colors, frequency-tuned blobs — strong evidence that low-level vision is a converged problem.

Doesn’t generalize past the first layer: deeper conv weights operate in feature space (not pixel space), so direct visualization is uninformative.

Tier 2: Saliency via backprop (Simonyan, Vedaldi, Zisserman, ICLR Workshop 2014)

Question: which input pixels matter most for the predicted class?

Recipe:

Forward pass to get class score $S_{c}$ (use unnormalized score, not softmax probability — gradients of softmax are weird).
Backprop $\partial S_{c} / \partial x_{ij}$ to image pixels.
Take absolute value, max over RGB channels → 2D saliency map.

The bright spots show pixels that, if perturbed, would most change the score. Rough but cheap object localization fall-out from a classifier trained only with image-level labels.

Guided Backprop (Springenberg et al. ICLR Workshop 2015)

Standard backprop through a ReLU passes gradient only where the forward activation was $> 0$ . Guided backprop adds a second filter: also zero out gradients where the gradient itself is negative. So only positive gradients flowing through positive activations get through.

Forward (ReLU):          Backward (standard):       Backward (guided):
[ 1 -1  5]    [1 0 5]     [-2  0 -1]                 [0 0 0]
[ 2 -5 -7] →  [2 0 0]     [ 6  0  0]                 [6 0 0]
[-3  2  4]    [0 2 4]     [ 0 -1  3]                 [0 0 3]

Visually much cleaner — produces sharp, recognizable visualizations of what each intermediate neuron “looks for”.

Tier 3: Class Activation Mapping (CAM) — Zhou et al. CVPR 2016

CAM only works on architectures that end with Global Average Pool → single FC → softmax (e.g. ResNet, GoogLeNet). The trick exploits that GAP commutes with the FC layer.

Setup. Last conv layer outputs features $f \in R^{H \times W \times K}$ , then: $F_{k} = \frac{1}{H W} \sum_{h, w} f_{h, w, k} (GAP)$ $S_{c} = \sum_{k} w_{k, c} F_{k} = \frac{1}{H W} \sum_{h, w} \sum_{k} w_{k, c} f_{h, w, k}$

The class activation map is the inner sum before spatial averaging: $M_{c, h, w} = \sum_{k} w_{k, c} f_{h, w, k}, M \in R^{C \times H \times W}$

Up-sample $M_{c}$ to image resolution, overlay as a heatmap → “for class $c$ , here’s where in the image the evidence came from”. Discriminative localization with no localization labels.

Limitation: CAM only applies to the last conv layer (because the derivation requires the GAP-then-FC structure). Architectures without that head can’t use CAM directly.

Tier 4: Grad-CAM (Selvaraju et al. CVPR 2017)

Generalizes CAM to any layer, any architecture, by replacing the analytical FC weights with gradients.

Recipe:

Pick any layer with activations $A \in R^{H \times W \times K}$ .
Compute $\partial S_{c} / \partial A \in R^{H \times W \times K}$ .
Global-average-pool the gradients to get per-channel weights: $α_{k} = \frac{1}{H W} \sum_{h, w} \frac{\partial S _{c}}{\partial A _{h, w, k}}$
Weighted combination of activations, then ReLU (only show evidence for, not against, the class): $M_{h, w}^{c} = ReLU (\sum_{k} α_{k} A_{h, w, k})$

The $α_{k}$ encodes “how important is channel $k$ for class $c$ ” via gradient magnitude, replacing CAM’s analytic $w_{k, c}$ . Reduces to standard CAM for the GAP-then-FC case.

Why ReLU? Without it, $M^{c}$ would also highlight regions that suppress the class — visually confusing. ReLU keeps only positive contributions.

Why ViTs are different

ViT doesn’t have conv feature maps in the same way — but you can visualize attention weights from the [CLS] token (or pooled token) to all patches, which serves the same purpose: which patches contributed to the prediction. Often cleaner than Grad-CAM because attention is itself a learned spatial weighting.

Source

CS231n 2025 Lec 9 slides 127–147, 175–178 (first layer filters, saliency via backprop, CAM derivation, Grad-CAM, intermediate features via guided backprop, ViT attention visualization). 2026 PDF not published — using 2025 fallback (April 29, 2025).

🛠️ Steven Gong

Table of Contents

Feature Visualization

Tier 1: Looking at weights directly

First-layer filters

Tier 2: Saliency via backprop (Simonyan, Vedaldi, Zisserman, ICLR Workshop 2014)

Guided Backprop (Springenberg et al. ICLR Workshop 2015)

Tier 3: Class Activation Mapping (CAM) — Zhou et al. CVPR 2016

Tier 4: Grad-CAM (Selvaraju et al. CVPR 2017)

Why ViTs are different

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Feature Visualization

Tier 1: Looking at weights directly

First-layer filters

Tier 2: Saliency via backprop (Simonyan, Vedaldi, Zisserman, ICLR Workshop 2014)

Guided Backprop (Springenberg et al. ICLR Workshop 2015)

Tier 3: Class Activation Mapping (CAM) — Zhou et al. CVPR 2016

Tier 4: Grad-CAM (Selvaraju et al. CVPR 2017)

Why ViTs are different

Source

Related

Graph View

Backlinks