Regularization
Regularization prevents overfitting. This is used widely in Machine Learning. Related to the idea of Occam’s Razor.
Regularization penalizes/punishes the complexity of the model. is the regularization parameter, so we usually have .
I actually had this in my Intact interview, where they had me explain Dropout and I actually didn’t know what it was.
We have a general formula
Types
- L2 Regularization
- L1 Regularization
- Elastic Net (L1 + L2)
- Max norm regularization
Neural Network Specific:
- Dropout
- Batch Normalization
- Stochastic depth
Dropout
This refers to when we “dropout” nodes, i.e. deactivate them, so their weights don’t have an effect on the rest of the network.
- I was asked about this in my Intact interview
How could this possibly be a good idea? It forces the network to have a redundant representation, helping overfitting. Another interpretation is that dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~1 datapoint.
When we dropout, the weights are 0, so in Backprop, the gradient will be 0.
During test time, we just do gradient boosting.
Gradient Checking → See stanford notes
2. Define the input layer and first hidden layer. Add Dropout regularization, which prevents overfitting.
From MATH213
Definition 2: Regularization
A function is the regularization of a function if
- For all in the domain of .
- For all such that exists but is undefined, .
Theorem 1
If is the regularization of a function that has a finite number of discontinuities then
Also see Finite Zeros and Finite Poles.