Dropout
Dropout is a regularization technique that randomly zeroes activations during training so no single neuron can be relied on.
Why would randomly breaking the network help?
It forces redundant representations and prevents co-adaptation (neuron A can’t rely on neuron B always being on). Another view: dropout trains a huge ensemble of weight-sharing subnetworks, one per sampled mask.
- When an activation is dropped, its output and thus its gradient in Backprop are both 0
- I was asked about this in my Intact interview
Train / Test
From the CS231n Lec 6 slides:
- Train: at each forward pass, sample a random binary mask and zero out activations with probability , common setting
- Test: don’t drop, instead scale all activations by so the expected output matches training
This is the general regularization recipe: at train time (random ), and at test we want
where:
- is the random binary mask
- is its distribution
Intuition
At train time, each neuron fires with probability , so its downstream expected contribution is . At test time nothing drops, so activations are larger on average by a factor of . Scaling by at test time (or equivalently at train time, “inverted dropout”) lines the two up so the network sees the same expected input distribution in both phases.
Dropout’s “scale by ” is the cheap deterministic approximation to this integral.
Ensemble Interpretation
A net with activations has possible binary masks. For a 4096-wide layer that’s subnetworks sharing weights. Each forward pass trains one of them on one minibatch; test-time scaling averages the ensemble.
Intuition
You can’t rely on any specific teammate showing up, so every neuron has to be individually useful. No “this feature works only if that other feature co-activates” hacks, because at any given step the other feature might be dropped. It’s ensemble learning disguised as a regularizer: you train an exponential number of subnetworks with shared weights, and at test time you implicitly average them by scaling.