Stochastic Gradient Descent (SGD)

Stochastic gradient descent samples the gradient.

SGD

  • Stochastic Gradient Descent Randomly select a subset of training data.
  • Gradient depends only on selected subset

Helps avoid getting stuck in local minimum.

Stochastic gradient descent is a specific case of Mini-Batch Gradient Descent.

CS294

SGD minimizes expectations, for a differentiable function of , SGD solves We can use this with Maximum Likelihood Estimation

Three problems with vanilla SGD (CS231n)

  1. Poor conditioning — when the loss changes quickly along one direction and slowly along another (high-condition-number Hessian), SGD jitters along the steep axis and crawls along the shallow one. Condition number = ratio of largest to smallest singular value of the Hessian.

  2. Local minima and saddle points — at saddle points the gradient is zero, so vanilla SGD stalls. Saddle points are much more common than local minima in high dimensions — a critical point in is a local min only if all Hessian eigenvalues have the same sign, which is exponentially unlikely. (Dauphin et al. 2014)

  3. Noisy gradients — minibatches give a stochastic estimate of the true gradient. Trajectories meander even on convex losses.

All three are addressed by adding velocity (momentum) and/or per-parameter learning rates (RMSProp, Adam) — see Gradient Descent for the optimizer family with code.

Source

CS231n Lec 3 slides 52–61 (SGD update rule, three problems with vanilla SGD).