Regularization

Regularization prevents overfitting. This is used widely in Machine Learning. Related to the idea of Occam’s Razor.

Regularization penalizes/punishes the complexity of the model. $λ$ is the regularization parameter, so we usually have $λ R (W)$ .

I actually had this in my Intact interview, where they had me explain Dropout and I actually didn’t know what it was.

We have a general formula $L = \frac{1}{N} i \sum L_{i} (f (x_{i}, W), y_{i}) + λ R (W)$

Types

L2 Regularization $R (W) = k \sum l \sum W_{k, l}^{2}$
L1 Regularization $R (W) = k \sum l \sum ∣ W_{k, l} ∣$
Elastic Net (L1 + L2) $R (W) = k \sum l \sum β W_{k, l}^{2} + ∣ W_{k, l} ∣$
Max norm regularization

Neural Network Specific:

Dropout
Batch Normalization
Stochastic depth

Dropout

This refers to when we “dropout” nodes, i.e. deactivate them, so their weights don’t have an effect on the rest of the network.

I was asked about this in my Intact interview

How could this possibly be a good idea? It forces the network to have a redundant representation, helping overfitting. Another interpretation is that dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~1 datapoint.

When we dropout, the weights are 0, so in Backprop, the gradient will be 0.

During test time, we just do gradient boosting.

Gradient Checking → See stanford notes

2. Define the input layer and first hidden layer. Add Dropout regularization, which prevents overfitting.

From MATH213

Definition 2: Regularization

A function $f$ is the regularization of a function $g$ if

For all $x$ in the domain of $g, f (x) = g (x)$ .

For all $x$ such that $lim_{t \to x} g (x)$ exists but $f (x)$ is undefined, $f (x) = lim_{t \to x} g (x)$ .

Theorem 1

If $f (x)$ is the regularization of a function $g$ that has a finite number of discontinuities then $\int f (x) d x = \int g (x) d x$

Also see Finite Zeros and Finite Poles.

Entropic Regularization

🛠️ Steven Gong

Table of Contents

Regularization

Dropout

From MATH213

Graph View

Backlinks