Dropout
This refers to when we “dropout” nodes, i.e. deactivate them, so their weights don’t have an effect on the rest of the network.
- I was asked about this in my Intact interview
How could this possibly be a good idea? It forces the network to have a redundant representation, helping overfitting. Another interpretation is that dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~1 datapoint.
When we dropout, the weights are 0, so in Backprop, the gradient will be 0.
During test time, we just do gradient boosting.
Gradient Checking → See stanford notes
2. Define the input layer and first hidden layer. Add Dropout regularization, which prevents overfitting.