Gradient Descent
Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks.
You should be familiar with Partial Derivatives first.
Let be a differentiable function of parameter/weights vector . Define the gradient of to be
To find a local minimum of , adjust in direction of negative gradient: where is the learning rate (step-size parameter).
The step size is also called the learning rate.
There are two ways to compute the gradient:
- Numerical Gradient: Not practically feasible. See page for code.
- Analytical Gradient: Uses calculus.
Now, Backpropagation comes into the picture when we look at how to compute the evaluate_gradient
function. For SVM for example, we have a fixed analytical gradient that we calculated, but we want to be able to do this for any function.
Challenges
- Running into a local minima / saddle point → fix those challenges by adding stochasticity.
Related
WAIT, Serendipity, we saw this sort of idea in E&M with Electric Potential Energy, remember the lab. We saw that the gradient represents the change in potential (wait how do Electric Field, Electric Potential, and Electric Potential Energy relate?)
Optimizer
SGD:
momentum update
As you scale up your network, it looks more like a bowl. Local minimum become less and less of an issue. It only really happens with small networks. - Andrej Karpathy
Adam is the default choice. See original paper: https://arxiv.org/abs/1412.6980v8
Use exponential decay for learning rate.