Convolutional Neural Network

Feature Visualization

What does a CNN actually look at? A grab-bag of techniques for inspecting trained CNNs β€” where in the image the model attends, what each neuron responds to, which pixels swing the prediction.

Why care?

A trained classifier is a black box that outputs a class label. Feature visualization opens it up: it reveals failure modes (the network classifies β€œwolf” because of the snowy background, not the animal), validates that the model learned semantically meaningful features, and powers downstream tools like weakly-supervised localization (object localization for free, just from classification labels).

Tier 1: Looking at weights directly

First-layer filters

The first conv layer’s filters have shape β€” directly visualizable as RGB images. For AlexNet (), ResNet-18/101 (), DenseNet-121 (), the learned filters look almost identical: oriented edges, opposing colors, frequency-tuned blobs β€” strong evidence that low-level vision is a converged problem.

Doesn’t generalize past the first layer: deeper conv weights operate in feature space (not pixel space), so direct visualization is uninformative.

Tier 2: Saliency via backprop (Simonyan, Vedaldi, Zisserman, ICLR Workshop 2014)

Question: which input pixels matter most for the predicted class?

Recipe:

  1. Forward pass to get class score (use unnormalized score, not softmax probability β€” gradients of softmax are weird).
  2. Backprop to image pixels.
  3. Take absolute value, max over RGB channels β†’ 2D saliency map.

The bright spots show pixels that, if perturbed, would most change the score. Rough but cheap object localization fall-out from a classifier trained only with image-level labels.

Guided Backprop (Springenberg et al. ICLR Workshop 2015)

Standard backprop through a ReLU passes gradient only where the forward activation was . Guided backprop adds a second filter: also zero out gradients where the gradient itself is negative. So only positive gradients flowing through positive activations get through.

Forward (ReLU):          Backward (standard):       Backward (guided):
[ 1 -1  5]    [1 0 5]     [-2  0 -1]                 [0 0 0]
[ 2 -5 -7] β†’  [2 0 0]     [ 6  0  0]                 [6 0 0]
[-3  2  4]    [0 2 4]     [ 0 -1  3]                 [0 0 3]

Visually much cleaner β€” produces sharp, recognizable visualizations of what each intermediate neuron β€œlooks for”.

Tier 3: Class Activation Mapping (CAM) β€” Zhou et al. CVPR 2016

CAM only works on architectures that end with Global Average Pool β†’ single FC β†’ softmax (e.g. ResNet, GoogLeNet). The trick exploits that GAP commutes with the FC layer.

Setup. Last conv layer outputs features , then:

The class activation map is the inner sum before spatial averaging:

Up-sample to image resolution, overlay as a heatmap β†’ β€œfor class , here’s where in the image the evidence came from”. Discriminative localization with no localization labels.

Limitation: CAM only applies to the last conv layer (because the derivation requires the GAP-then-FC structure). Architectures without that head can’t use CAM directly.

Tier 4: Grad-CAM (Selvaraju et al. CVPR 2017)

Generalizes CAM to any layer, any architecture, by replacing the analytical FC weights with gradients.

Recipe:

  1. Pick any layer with activations .
  2. Compute .
  3. Global-average-pool the gradients to get per-channel weights:
  4. Weighted combination of activations, then ReLU (only show evidence for, not against, the class):

The encodes β€œhow important is channel for class ” via gradient magnitude, replacing CAM’s analytic . Reduces to standard CAM for the GAP-then-FC case.

Why ReLU? Without it, would also highlight regions that suppress the class β€” visually confusing. ReLU keeps only positive contributions.

Why ViTs are different

ViT doesn’t have conv feature maps in the same way β€” but you can visualize attention weights from the [CLS] token (or pooled token) to all patches, which serves the same purpose: which patches contributed to the prediction. Often cleaner than Grad-CAM because attention is itself a learned spatial weighting.

Source

CS231n 2025 Lec 9 slides 127–147, 175–178 (first layer filters, saliency via backprop, CAM derivation, Grad-CAM, intermediate features via guided backprop, ViT attention visualization). 2026 PDF not published β€” using 2025 fallback (April 29, 2025).