# Regularization

Regularization prevents overfitting. This is used widely in Machine Learning. Related to the idea of Occam’s Razor.

Regularization penalizes/punishes the complexity of the model. $λ$ is the regularization parameter, so we usually have $λR(W)$.

I actually had this in my Intact interview, where they had me explain Dropout and I actually didn’t know what it was.

We have a general formula $L=N1 i∑ L_{i}(f(x_{i},W),y_{i})+λR(W)$

Types

- L2 Regularization $R(W)=k∑ l∑ W_{k,l}$
- L1 Regularization $R(W)=k∑ l∑ ∣W_{k,l}∣$
- Elastic Net (L1 + L2) $R(W)=k∑ l∑ βW_{k,l}+∣W_{k,l}∣$
- Max norm regularization

Neural Network Specific:

- Dropout
- Batch Normalization
- Stochastic depth

### Dropout

This refers to when we “dropout” nodes, i.e. deactivate them, so their weights don’t have an effect on the rest of the network.

- I was asked about this in my Intact interview

How could this possibly be a good idea? It forces the network to have a redundant representation, helping overfitting. Another interpretation is that dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~1 datapoint.

When we dropout, the weights are 0, so in Backprop, the gradient will be 0.

During test time, we just do gradient boosting.

Gradient Checking → See stanford notes

**2. Define the input layer and first hidden layer. Add Dropout regularization, which prevents** **overfitting****.**

### From MATH213

Definition 2: Regularization

A function $f$ is the regularization of a function $g$ if

- For all $x$ in the domain of $g,f(x)=g(x)$.
- For all $x$ such that $lim_{t→x}g(x)$ exists but $f(x)$ is undefined, $f(x)=lim_{t→x}g(x)$.

Theorem 1

If $f(x)$ is the regularization of a function $g$ that has a finite number of discontinuities then $∫f(x)dx=∫g(x)dx$

Also see Finite Zeros and Finite Poles.