# Gradient Descent

Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks.

You should be familiar with Partial Derivatives first.

Let $J(w)$ be a differentiable function of parameter/weights vector $w$. Define the gradient of $J(w)$ to be $∇_{w}J(w)=(∂w_{1}∂J(w) ⋮∂w_{n}∂J(w) )$

To find a local minimum of $J(w)$, adjust $w$ in direction of negative gradient: $∆w=−21 α∇_{w}J(w)$ where $α$ is the learning rate (step-size parameter).

The step size is also called the learning rate.

There are two ways to compute the gradient:

- Numerical Gradient: Not practically feasible. See page for code.
- Analytical Gradient: Uses calculus.

Now, Backpropagation comes into the picture when we look at how to compute the `evaluate_gradient`

function. For SVM for example, we have a fixed analytical gradient that we calculated, but we want to be able to do this for any function.

Challenges

- Running into a local minima / saddle point → fix those challenges by adding stochasticity.

### Related

WAIT, Serendipity, we saw this sort of idea in E&M with Electric Potential Energy, remember the lab. We saw that the gradient represents the change in potential (wait how do Electric Field, Electric Potential, and Electric Potential Energy relate?)

### Optimizer

SGD:

momentum update

As you scale up your network, it looks more like a bowl. Local minimum become less and less of an issue. It only really happens with small networks. - Andrej Karpathy

**Adam** is the default choice. See original paper: https://arxiv.org/abs/1412.6980v8

Use exponential decay for learning rate. $α=α_{0}e_{kt}$