Stochastic Gradient Descent (SGD)
Stochastic gradient descent samples the gradient.
SGD
- Stochastic Gradient Descent Randomly select a subset of training data.
- Gradient depends only on selected subset
Helps avoid getting stuck in local minimum.
Stochastic gradient descent is a specific case of Mini-Batch Gradient Descent.
CS294
SGD minimizes expectations, for a differentiable function of , SGD solves We can use this with Maximum Likelihood Estimation
Three problems with vanilla SGD (CS231n)
-
Poor conditioning — when the loss changes quickly along one direction and slowly along another (high-condition-number Hessian), SGD jitters along the steep axis and crawls along the shallow one. Condition number = ratio of largest to smallest singular value of the Hessian.
-
Local minima and saddle points — at saddle points the gradient is zero, so vanilla SGD stalls. Saddle points are much more common than local minima in high dimensions — a critical point in is a local min only if all Hessian eigenvalues have the same sign, which is exponentially unlikely. (Dauphin et al. 2014)
-
Noisy gradients — minibatches give a stochastic estimate of the true gradient. Trajectories meander even on convex losses.
All three are addressed by adding velocity (momentum) and/or per-parameter learning rates (RMSProp, Adam) — see Gradient Descent for the optimizer family with code.
Source
CS231n Lec 3 slides 52–61 (SGD update rule, three problems with vanilla SGD).