Regularization prevents overfitting. This is used widely in Machine Learning. Related to the idea of Occam’s Razor.

Regularization penalizes/punishes the complexity of the model. is the regularization parameter, so we usually have .

I actually had this in my Intact interview, where they had me explain Dropout and I actually didn’t know what it was.

We have a general formula


  • L2 Regularization
  • L1 Regularization
  • Elastic Net (L1 + L2)
  • Max norm regularization

Neural Network Specific:


This refers to when we “dropout” nodes, i.e. deactivate them, so their weights don’t have an effect on the rest of the network.

  • I was asked about this in my Intact interview

How could this possibly be a good idea? It forces the network to have a redundant representation, helping overfitting. Another interpretation is that dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~1 datapoint.

When we dropout, the weights are 0, so in Backprop, the gradient will be 0.

During test time, we just do gradient boosting.

Gradient Checking See stanford notes

2. Define the input layer and first hidden layer. Add Dropout regularization, which prevents overfitting.

From MATH213

Definition 2: Regularization

A function is the regularization of a function if

  1. For all in the domain of .
  2. For all such that exists but is undefined, .

Theorem 1

If is the regularization of a function that has a finite number of discontinuities then

Also see Finite Zeros and Finite Poles.