Dropout

This refers to when we “dropout” nodes, i.e. deactivate them, so their weights don’t have an effect on the rest of the network.

I was asked about this in my Intact interview

How could this possibly be a good idea? It forces the network to have a redundant representation, helping overfitting. Another interpretation is that dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~1 datapoint.

When we dropout, the weights are 0, so in Backprop, the gradient will be 0.

During test time, we just do gradient boosting.

Gradient Checking → See stanford notes

2. Define the input layer and first hidden layer. Add Dropout regularization, which prevents overfitting.

🛠️ Steven Gong

Dropout

Graph View

Backlinks