Gradient Descent

Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks.

You should be familiar with Partial Derivatives first.

Let be a differentiable function of parameter/weights vector . Define the gradient of to be

To find a local minimum of , adjust in direction of negative gradient: where is the learning rate (step-size parameter).

The step size is also called the learning rate.

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update

There are two ways to compute the gradient:

Now, Backpropagation comes into the picture when we look at how to compute the evaluate_gradient function. For SVM for example, we have a fixed analytical gradient that we calculated, but we want to be able to do this for any function.

Challenges

  1. Running into a local minima / saddle point fix those challenges by adding stochasticity.

WAIT, Serendipity, we saw this sort of idea in E&M with Electric Potential Energy, remember the lab. We saw that the gradient represents the change in potential (wait how do Electric Field, Electric Potential, and Electric Potential Energy relate?)

Optimizer

SGD:

momentum update

# Vanilla Gradient Descent
x += - learning_rate * dx
 
# Momentum update
"""
- We build up velocity along shallow directions.
- Velocity becomes damped in steep direction due to quickly changing sign.(you can initialize v=0)
"""
v = mu * v - learning_rate * dx # mu * v is like the friction, mu is a hyperparameter, mu = 0.5 or 0.9
x += v
 
# Nesterov Momentum update (this is a trick to get a look-ahead)
v_prev= v
v = mu * v - learning_rate * dx
x += -mu * v_prev + (1 + mu) * v
 
# Adagrad
cache += dx ** 2
x += - learning_rate * dx / (np.sqrt(cache) + 1e-7)
 
# RMSProp (improvement of Adagrad)
cache += decay_rate * cache + (1-decay_rate) * dx ** 2
x += - learning_rate * dx / (np.sqrt(cache) + 1e-7)
 
# Adam (combines momentum and RMSProp)
m = beta1*m + (1-beta1)*dx # update first moment
v = beta2*v + (1-beta2)*(dx**2) # update second moment
m /= 1 - beta1**t # correct bias (only relevant in the first few iterations)
v /= 1 - beta2**t # ""
x += - learning_rate * m / (np.sqrt(v) + 1e-7)

As you scale up your network, it looks more like a bowl. Local minimum become less and less of an issue. It only really happens with small networks. - Andrej Karpathy

Adam is the default choice. See original paper: https://arxiv.org/abs/1412.6980v8

Use exponential decay for learning rate.