Residual Network (ResNet)

Original Paper:

Residual Network (ResNet) is a Convolutional Neural Network (CNN) architecture that overcame the “vanishing gradient” problem, making it possible to construct networks with up to thousands of convolutional layers, which outperform shallower networks.

https://www.run.ai/guides/deep-learning-for-computer-vision/pytorch-resnet

Deeper models are harder to optimize, because the propagation of the gradient is very hard.

So we can copy the learned layers, to make the propagation of gradients easier. This makes learning faster, and it works super well.

The motivating observation (CS231n Lec 6)

What happens when you stack more layers on a “plain” (non-residual) ConvNet? Empirically, the 56-layer net performs worse than the 20-layer net on both train and test error — so the problem is not overfitting. Deep models have strictly more representation power than shallow models (they can copy the shallow model and set extra layers to identity), so this must be an optimization failure: deeper plain nets are harder to optimize.

The fix: residual blocks

Instead of asking the layers to fit some desired mapping $H (x)$ directly, fit the residual $F (x) = H (x) - x$ and add the input back via a skip connection:

$H (x) = F (x) + x$

If the optimal mapping is close to identity (which is the failure case for plain nets — they can’t even copy a shallow net), the layers just need to learn $F (x) \approx 0$ , which is easy. Identity comes for free; only the deviation from identity needs to be learned.

       H(x) = F(x) + x
              ↑
            relu
              ↑
           ⊕ ←─── x (identity skip)
              ↑
           3x3 conv
              ↑
            relu
              ↑
           3x3 conv
              ↑
              x

Full architecture

Stack residual blocks; each block has two 3×3 conv layers (basic block) — bottleneck blocks add 1×1 reductions for deeper variants.
Periodically double the number of filters and downsample spatially with stride 2 (halving each spatial dim, doubling channel count → activation volume halves).
A 7×7 stride-2 conv stem at the input.
Global average pool → single FC → softmax at the output.

ImageNet variants: ResNet-18, -34, -50, -101, -152.

Results

ResNet-152 won ILSVRC’15 with 3.57% top-5 error (vs VGG’s 7.3%, GoogLeNet’s 6.7%).
Swept all classification + detection competitions in ILSVRC’15 and COCO’15.
8× deeper than VGG yet lower complexity in FLOPs (thanks to bottlenecks + global avg pool removing big FC layers).

Source

CS231n Lec 6 slides 46–58 (plain-net optimization failure, residual block, full architecture, ImageNet results).

🛠️ Steven Gong

Table of Contents

Residual Network (ResNet)

The motivating observation (CS231n Lec 6)

The fix: residual blocks

Full architecture

Results

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Residual Network (ResNet)

The motivating observation (CS231n Lec 6)

The fix: residual blocks

Full architecture

Results

Source

Related

Graph View

Backlinks