Dropout

This refers to when we “dropout” nodes, i.e. deactivate them, so their weights don’t have an effect on the rest of the network.

  • I was asked about this in my Intact interview

How could this possibly be a good idea? It forces the network to have a redundant representation, helping overfitting. Another interpretation is that dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~1 datapoint.

When we dropout, the weights are 0, so in Backprop, the gradient will be 0.

During test time, we just do gradient boosting.

Gradient Checking See stanford notes

2. Define the input layer and first hidden layer. Add Dropout regularization, which prevents overfitting.