Neural Network

Convolutional Neural Network (CNN)

A CNN is a neural network that uses spatially-shared convolutional filters instead of dense weight matrices, making it well-suited to image data.

Why CNN instead of an MLP for images?

With a vanilla MLP on pixel inputs, the dimensionality is too high and the network can’t learn. A CNN learns small filter weights that slide across the image, so parameters don’t scale with image size

These are very important for Image Classification and Object Detection. Whenever I am ready to implement a CNN from scratch, watch this ConvNet in practice lecture.

At the end, features are flattened and passed through an MLP for classification.

Three types of layers in ConvNet architecture:

  • Convolution (CONV) layer: much of the design work is selecting hyperparameters (padding, strides, number of filters, kernel size)
  • Pooling (POOL) layer
  • Fully Connected (FC) layer

CNN Architectures

  • Classic Networks
    • LeNet-5 (1980s)
    • AlexNet
      • First use of ReLU
      • Used LRN layers (not common anymore)
      • Heavy data augmentation
      • Dropout 0.5
      • Batch size 128
      • SGD Momentum 0.9
      • Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus
      • L2 weight decay 5e-4
      • 7 CNN ensemble: 18.2% 15.4%
    • VGG
  • ResNet
  • Inception
  • U-Net

Advantages:

  • Parameter sharing: a feature detector (e.g. a vertical edge detector) useful in one part of the image is probably useful elsewhere too
  • Sparsity of connections: each output value depends on only a small number of inputs

Why CNNs over MLPs for images (CS231n Lec 5)

Stretching a 32×32×3 image into a 3072-vector for an MLP destroys the spatial structure: pixels that were neighbors are no longer treated as neighbors. The conv layer instead preserves the H×W×C volume, sliding small filters across spatial locations.

Intuition

Convolution is weight sharing across positions. A filter that detects “vertical edge” in the top-left also detects it in the bottom-right, so translation invariance is baked into the architecture instead of having to be learned from scratch. Pooling adds a bit of scale invariance on top, and stacking convs lets later layers build hierarchical features (edges, to textures, to parts, to objects).

Each filter is Cin × KH × KW and always extends the full input depth. For a 32×32×3 input, a 5×5 filter is really 5×5×3. One filter slid across the input produces one activation map of shape 1 × H' × W'. Stacking Cout filters gives an output volume of Cout × H' × W'. The full layer params: weights Cout × Cin × KH × KW plus a Cout-dim bias vector.

A ConvNet is just (Conv → ReLU → Conv → ReLU → … → optional Pool → … → FC) end-to-end with backprop.

Convolution layer: shapes and hyperparameters

namesymboltypical
input channels3, 32, 64, …
output channels (# filters)32, 64, 128, 256 (powers of 2)
kernel size ( usually)1, 3, 5
padding for “same”
stride1 (preserve), 2 (downsample)

Output spatial size:

Read this as: the filter needs pixels of room to sit, padding gives it extra room at the edges, and stride says how many pixels you hop each time. Without padding, corners get visited less than the middle; “same” padding keeps every input pixel equally weighted.

Common settings: 3×3 conv with (same), 5×5 with (same), 3×3 with (downsample by 2), 1×1 with .

Worked example. Input 3×32×32, ten 5×5 filters with stride 1, pad 2:

  • Output: (since )
  • Params:
  • Multiply-adds: outputs × ops each = 768K

1×1 convolutions

A 1×1 conv with filters and input produces . Each output spatial cell is a -dim dot product with the input cell at the same location. It’s effectively a per-pixel fully-connected layer across channels. Used everywhere as a cheap channel-mixer or channel-reducer (see ResNet bottlenecks).

Think of 1×1 convs as “rotating” the channel axis: no spatial mixing happens, they just relearn what the feature basis is at every pixel.

Receptive fields

The receptive field of an output cell is the region of the input it depends on. For one conv, it’s . Stacking conv layers grows the receptive field linearly:

Receptive field is how far “back” one neuron sees. If a deep-layer neuron has to classify a dog, it needs enough receptive field to actually see the dog; if not, it’s trying to make a global decision from a peephole. That’s why downsampling exists: each pool doubles the receptive field without adding layers.

Receptive field "in the input" vs "in the previous layer"

These are different things. The formula above is for the input

Problem with conv-only stacks: large images (224×224) need many layers before any output can “see” the whole image. Solution: downsample inside the network, via either strided conv or pooling.

Translation equivariance

Convolution and pooling commute with spatial translation:

Intuition: a feature (edge, eye, wheel) doesn’t change identity when it moves around the image, so the same filter should fire, just at a translated location. This is the inductive bias that makes CNNs sample-efficient on images.

Note: equivariance (output translates with input) is a different property from invariance (output doesn’t change). Pooling adds a small amount of local invariance on top of equivariance.

What conv filters learn

  • Linear classifier (1 layer): one whole-image template per class
  • MLP: bank of whole-image templates (e.g. 64 hidden units = 64 templates)
  • First conv layer: small local templates, typically oriented edges and opposing colors. AlexNet’s first layer is 64 filters of size 3×11×11, and the visualization looks like a Gabor bank
  • Deeper conv layers: harder to visualize directly, but tend to respond to larger structures (eyes, letters, textures). See Feature Visualization

1D / 2D / 3D convolutions

Same idea, different spatial dimensionality.

variantinputweights
1D conv (audio, sequences)
2D conv (images)
3D conv (video, volumetric)

See 3D CNN for video-specific use.

PyTorch API

torch.nn.Conv2d(in_channels, out_channels, kernel_size,
                stride=1, padding=0, dilation=1, groups=1, bias=True)

dilation spaces out the kernel taps (à trous). groups partitions input channels into independent conv groups (groups= gives depthwise conv).

From CS231n Lec 5 slides 42-106 (conv layer mechanics, filter dimensions, activation maps, output shape, padding, stride, receptive fields, 1×1 conv, translation equivariance, pooling, what filters learn, 1D/2D/3D, PyTorch API).

ImageNet architectures: a brief tour

From CS231n Lec 6, top-5 error on ILSVRC year by year (every drop was a paradigm shift):

YearModelTop-5 errLayersKey idea
2010shallow28.2%-hand-crafted features
2012AlexNet16.4%8first deep ConvNet to win (ReLU, dropout, GPU training)
2013ZFNet11.7%8tweaked AlexNet hyperparameters
2014VGG7.3%19small filters, deeper
2014GoogLeNet6.7%22Inception modules
2015ResNet3.57%152residual blocks, “revolution of depth”
2017SENet2.3%-squeeze-and-excitation

VGG (Simonyan & Zisserman 2014)

Design rule: only 3×3 conv (stride 1, pad 1) + 2×2 max pool (stride 2). No fancy shapes, just stack them.

Why all 3×3 instead of bigger filters? Three stacked 3×3 convs have the same effective receptive field as one 7×7 conv (1 + 3·(3-1) = 7), but with:

  • More nonlinearities (3 ReLUs instead of 1, richer feature transforms)
  • Fewer parameters. For channels per layer: vs for the single 7×7

This is the general “deep + small filters > shallow + big filters” lesson that everything since has followed.

AlexNet (Krizhevsky et al. 2012)

The model that started modern deep learning for vision. Notable decisions:

  • First mainstream use of ReLU
  • Heavy data augmentation, dropout 0.5
  • SGD + momentum 0.9, LR 1e-2 (manual decay by 10× when val plateaus), L2 weight decay 5e-4
  • Trained on two GTX 580 GPUs (the now-historical channel split was a memory hack)
  • 7-CNN ensemble: 18.2% → 15.4% top-5 error

From CS231n Lec 6 slides 33-45 (ImageNet history chart, VGG case study with effective receptive field) and 46-58 (ResNet, see paper note).

Some Ideas

“Flattening” out the pooling layer.

ConvNet in Practice

Really helpful video.

Data Augmentation