Backpropagation

Backpropagation computes gradients of a scalar output with respect to every variable by recursive application of the chain rule. It is at the core of any modern Deep Learning model.

Intuition

Chain rule applied top-down and cached. Picture error signals flowing backward through the computation graph; at every node, the signal gets multiplied by the local derivative and passed further back. Without caching you’d redo exponential work recomputing upstream gradients for every parameter; with caching, the entire backward pass costs roughly 2x the forward pass no matter how many parameters you have. That’s the whole trick that makes training a billion-parameter net feasible.

How it fits into training

After a backward pass computes all gradients, we update the weights via Gradient Descent. Real implementations batch computations into tensors to exploit hardware parallelism.

From Calculus, the Gradient is the vector of partial derivatives. In ML, β€œthe gradient on ” means the partial derivative w.r.t. .

The gradient flows by multiplication through the Chain Rule. A simple example:

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(5.0, label='c')
e = a*b; e.label = 'e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = f * d; L.label='L'
 
 
# Manually calculating
# node.grad = upwards gradient * gradient with respect to other terms of the equation
L.grad = 1.0
f.grad = L.grad * d
d.grad = L.grad * f
e.grad = d.grad
 
# Automatically (*magic*, no I actually understand this code-ish now)
L.backward()

Above, the gradient of e (derivative of β€œL” with respect to β€œe”) was calculated by doing , where (since ) and , which is calculated in the upper layer.

Addition is a gradient distributor, so it just flows through.

Backpropagation starts from the back because front layers depend on back layers via the Chain Rule. After it finishes, we update weights through Gradient Descent.

Accumulate with +=, not =

Overwriting with = gives the wrong value when a variable is used in multiple places. Accumulation is needed for cases like:

b = a + a
b.backward()
 
a.grad # 1 if implemented IMPROPERLY
 
# Answer:
a.grad # 2
# You get that, since under the hood it calls:
def __add__(self, other):
	out = Value(self.data + other.data, (self, other), '+')
	
	def _backward(): # is this a closure?
		self.grad += 1.0 * out.grad # IMPORTANT Use +=, NOT =
		other.grad += 1.0 * out.grad
	out._backward = _backward
	return out

Computational Graphs

Computational graphs let you pick the abstraction level for each operation. You can treat tanh as a single op or decompose it: the only requirement is defining how the local gradient is computed.

Example (vectorized, same idea)

Given input and layers , (with on the left):

To get the loss with respect to , we first need the loss with respect to :

Backpropagation is a local process: each gate only cares about its local gradient, multiplies it by the cached upstream gradient, and propagates the result backward. Each gate effectively signals whether its output should increase or decrease to lower the loss.

Vector / matrix backprop

From the CS231n Lec 4 slides. When inputs and outputs are vectors or tensors instead of scalars, the local derivative becomes a Jacobian matrix rather than a scalar:

input dimsoutput dimsderivative typeshape
scalarscalarregular derivative
vector scalargradient
vector vector Jacobian

The chain rule generalizes: downstream gradient = (Jacobian) upstream gradient. The loss stays scalar, so always has the same shape as . This is the most useful sanity check when implementing backward.

Implicit Jacobians (don’t form them)

Forming the Jacobian explicitly is usually disastrous. For a fully-connected layer with , the Jacobian would be GB. You always work with the Jacobian implicitly via a closed-form rule for the JVP.

ReLU example. elementwise. The Jacobian is diagonal with 1’s where , 0’s elsewhere: sparse and never instantiated. Backward is just:

Matrix multiply with , , :

These formulas are easy to remember because they’re the only way to make the shapes match up. (Karpathy: β€œif you can’t remember the formula, just multiply things and transpose until shapes line up.β€œ)

Each gate type has a recognizable backward signature (add = distributor, mul = swap multiplier, copy = adder, max = router). See Computational Graph for the full table.

Why those signatures

Add distributes because , so the upstream gradient passes straight through. Mul swaps because , so each input gets the other input times the upstream. Max is a router because only the argmax input actually influenced the output; the rest contributed nothing, gradient zero. Every backward rule is just β€œwhat would a tiny wiggle of this input do to the output?”