Kaiming Initialization

Kaiming initialization is a method used to set the initial weights of a neural network layer to help the network converge faster and more stably, especially for deep networks.

Kaiming initialization (He initialization) is still the default go-to method for initializing layers when you’re using ReLU or Leaky ReLU activations — even in very modern architectures.

Why initialization matters (CS231n Lec 6)

The size of the initial weights controls whether activations stay alive through the depth of a network. Two failure modes for a 6-layer FC net of width 4096 with ReLU:

Too smallW = 0.01 * randn(Din, Dout). Activations shrink toward zero as you go deeper:

Layer 1: mean=0.26, std=0.37
Layer 2: mean=0.12, std=0.17
Layer 3: mean=0.05, std=0.08
...
Layer 6: mean=0.00, std=0.01

Backward gradients vanish similarly — no learning happens in deep layers.

Too largeW = 0.05 * randn(Din, Dout). Activations blow up:

Layer 1: mean=1.27, std=1.86
Layer 2: mean=2.89, std=4.25
...
Layer 6: mean=74.50, std=109.24

Gradients explode; numerical instability.

The fix: scale by

W = np.random.randn(Din, Dout) * np.sqrt(2 / Din)

For ReLU networks, this keeps activations roughly unit-Gaussian at every layer regardless of depth. Verified empirically — layers 1–6 all show mean ≈ 0.55, std ≈ 0.81. Source: He et al., “Delving Deep into Rectifiers”, ICCV 2015.

The factor of 2 corrects for ReLU killing half the activations on average. For tanh / sigmoid use Xavier/Glorot init: instead.

Source

CS231n Lec 6 slides 61–66 (init too small / too large failure modes, fan-in scaling, Kaiming/MSRA init for ReLU).