Residual Network (ResNet)
Original Paper:
Residual Network (ResNet) is a Convolutional Neural Network (CNN) architecture that overcame the “vanishing gradient” problem, making it possible to construct networks with up to thousands of convolutional layers, which outperform shallower networks.
https://www.run.ai/guides/deep-learning-for-computer-vision/pytorch-resnet
Deeper models are harder to optimize, because the propagation of the gradient is very hard.
So we can copy the learned layers, to make the propagation of gradients easier. This makes learning faster, and it works super well.
The motivating observation (CS231n Lec 6)
What happens when you stack more layers on a “plain” (non-residual) ConvNet? Empirically, the 56-layer net performs worse than the 20-layer net on both train and test error — so the problem is not overfitting. Deep models have strictly more representation power than shallow models (they can copy the shallow model and set extra layers to identity), so this must be an optimization failure: deeper plain nets are harder to optimize.
The fix: residual blocks
Instead of asking the layers to fit some desired mapping directly, fit the residual and add the input back via a skip connection:
If the optimal mapping is close to identity (which is the failure case for plain nets — they can’t even copy a shallow net), the layers just need to learn , which is easy. Identity comes for free; only the deviation from identity needs to be learned.
H(x) = F(x) + x
↑
relu
↑
⊕ ←─── x (identity skip)
↑
3x3 conv
↑
relu
↑
3x3 conv
↑
x
Full architecture
- Stack residual blocks; each block has two 3×3 conv layers (basic block) — bottleneck blocks add 1×1 reductions for deeper variants.
- Periodically double the number of filters and downsample spatially with stride 2 (halving each spatial dim, doubling channel count → activation volume halves).
- A 7×7 stride-2 conv stem at the input.
- Global average pool → single FC → softmax at the output.
ImageNet variants: ResNet-18, -34, -50, -101, -152.
Results
- ResNet-152 won ILSVRC’15 with 3.57% top-5 error (vs VGG’s 7.3%, GoogLeNet’s 6.7%).
- Swept all classification + detection competitions in ILSVRC’15 and COCO’15.
- 8× deeper than VGG yet lower complexity in FLOPs (thanks to bottlenecks + global avg pool removing big FC layers).
Source
CS231n Lec 6 slides 46–58 (plain-net optimization failure, residual block, full architecture, ImageNet results).