Backpropagation
Backpropagation computes gradients of a scalar output with respect to every variable by recursive application of the chain rule. It is at the core of any modern Deep Learning model.
Intuition
Chain rule applied top-down and cached. Picture error signals flowing backward through the computation graph; at every node, the signal gets multiplied by the local derivative and passed further back. Without caching youβd redo exponential work recomputing upstream gradients for every parameter; with caching, the entire backward pass costs roughly 2x the forward pass no matter how many parameters you have. Thatβs the whole trick that makes training a billion-parameter net feasible.
How it fits into training
After a backward pass computes all gradients, we update the weights via Gradient Descent. Real implementations batch computations into tensors to exploit hardware parallelism.
From Calculus, the Gradient is the vector of partial derivatives. In ML, βthe gradient on β means the partial derivative w.r.t. .
The gradient flows by multiplication through the Chain Rule. A simple example:
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(5.0, label='c')
e = a*b; e.label = 'e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = f * d; L.label='L'
# Manually calculating
# node.grad = upwards gradient * gradient with respect to other terms of the equation
L.grad = 1.0
f.grad = L.grad * d
d.grad = L.grad * f
e.grad = d.grad
# Automatically (*magic*, no I actually understand this code-ish now)
L.backward()Above, the gradient of e (derivative of βLβ with respect to βeβ) was calculated by doing , where (since ) and , which is calculated in the upper layer.
Addition is a gradient distributor, so it just flows through.
Backpropagation starts from the back because front layers depend on back layers via the Chain Rule. After it finishes, we update weights through Gradient Descent.
Accumulate with
+=, not=Overwriting with
=gives the wrong value when a variable is used in multiple places. Accumulation is needed for cases like:
b = a + a
b.backward()
a.grad # 1 if implemented IMPROPERLY
# Answer:
a.grad # 2
# You get that, since under the hood it calls:
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
def _backward(): # is this a closure?
self.grad += 1.0 * out.grad # IMPORTANT Use +=, NOT =
other.grad += 1.0 * out.grad
out._backward = _backward
return outComputational Graphs
Computational graphs let you pick the abstraction level for each operation. You can treat tanh as a single op or decompose it: the only requirement is defining how the local gradient is computed.
Example (vectorized, same idea)
Given input and layers , (with on the left):
To get the loss with respect to , we first need the loss with respect to :
Backpropagation is a local process: each gate only cares about its local gradient, multiplies it by the cached upstream gradient, and propagates the result backward. Each gate effectively signals whether its output should increase or decrease to lower the loss.
- https://cs231n.github.io/optimization-2/
- To compute vectorized, use the Jacobian Matrix
- Karpathyβs article on why understanding backprop matters
Vector / matrix backprop
From the CS231n Lec 4 slides. When inputs and outputs are vectors or tensors instead of scalars, the local derivative becomes a Jacobian matrix rather than a scalar:
| input dims | output dims | derivative type | shape |
|---|---|---|---|
| scalar | scalar | regular derivative | |
| vector | scalar | gradient | |
| vector | vector | Jacobian |
The chain rule generalizes: downstream gradient = (Jacobian) upstream gradient. The loss stays scalar, so always has the same shape as . This is the most useful sanity check when implementing backward.
Implicit Jacobians (donβt form them)
Forming the Jacobian explicitly is usually disastrous. For a fully-connected layer with , the Jacobian would be GB. You always work with the Jacobian implicitly via a closed-form rule for the JVP.
ReLU example. elementwise. The Jacobian is diagonal with 1βs where , 0βs elsewhere: sparse and never instantiated. Backward is just:
Matrix multiply with , , :
These formulas are easy to remember because theyβre the only way to make the shapes match up. (Karpathy: βif you canβt remember the formula, just multiply things and transpose until shapes line up.β)
Each gate type has a recognizable backward signature (add = distributor, mul = swap multiplier, copy = adder, max = router). See Computational Graph for the full table.
Why those signatures
Add distributes because , so the upstream gradient passes straight through. Mul swaps because , so each input gets the other input times the upstream. Max is a router because only the argmax input actually influenced the output; the rest contributed nothing, gradient zero. Every backward rule is just βwhat would a tiny wiggle of this input do to the output?β