Computational Graph
A computational graph is a DAG where nodes are operations (gates) and edges carry intermediate values. It’s the data structure that makes Backpropagation modular: any expression — no matter how complex — can be decomposed into a graph of small differentiable operations, and gradients flow backward by recursive application of the Chain Rule.
This is what every modern autograd framework (PyTorch, JAX, TensorFlow) is built on.
Why we need it
The naïve alternative — derive by hand — fails for three reasons (CS231n Lec 4):
- Tedious: lots of matrix calculus, lots of paper
- Brittle: change the loss (e.g. softmax → hinge) and you re-derive everything from scratch
- Infeasible: complex models like a CNN or a Neural Turing Machine have computation graphs with hundreds of nodes — no human writes that derivative out
A graph + backprop solves all three: write the forward pass once, get gradients for free.
Simple example:
Decompose into two gates: , then .
With : forward gives , .
Backward (set ):
- Multiply gate: ,
- Add gate: ,
Each gate only needs to know its local gradient (). The chain rule multiplies in the upstream gradient () flowing back from later nodes.
Patterns in gradient flow
Each gate type has a characteristic backward behavior — useful for sanity-checking gradients:
- add gate → gradient distributor: copies the upstream gradient unchanged to all inputs
- mul gate → swap multiplier: , (swaps the other input in)
- copy gate (a value used twice) → gradient adder: sum the upstream gradients from each branch
- max gate → gradient router: passes the full upstream gradient to whichever input was the max; sends 0 to the others
Modular forward / backward API
Real frameworks expose each gate as a class with a forward and backward method. PyTorch’s torch.autograd.Function:
class Multiply(torch.autograd.Function):
@staticmethod
def forward(ctx, x, y):
ctx.save_for_backward(x, y) # cache values needed in backward
return x * y
@staticmethod
def backward(ctx, grad_z): # grad_z is the upstream dL/dz
x, y = ctx.saved_tensors
grad_x = y * grad_z # local dz/dx = y, times upstream
grad_y = x * grad_z
return grad_x, grad_yThe forward pass must save any intermediate values the backward pass will need (here, both x and y). This is why training memory grows with depth — every layer’s activations stick around until backward consumes them. (See Activation Checkpointing for trading compute to reduce this.)
“Flat” backprop code
For a fixed graph you can write backprop inline, in reverse-order of the forward pass. Sigmoid example :
# Forward
s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2 + w2
L = sigmoid(s3)
# Backward (reverse order)
grad_L = 1.0
grad_s3 = grad_L * (1 - L) * L # sigmoid local grad
grad_w2 = grad_s3 # add gate distributes
grad_s2 = grad_s3
grad_s0 = grad_s2 # add gate distributes
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1 # mul gate swaps
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0The structure mirrors the forward pass exactly — that’s the property autograd exploits.
Source
CS231n Lec 4 slides 54–116 (computational graphs, simple example, gate patterns, sigmoid example, flat code, modular forward/backward API).