Computational Graph

A computational graph is a DAG where nodes are operations (gates) and edges carry intermediate values. It’s the data structure that makes Backpropagation modular: any expression — no matter how complex — can be decomposed into a graph of small differentiable operations, and gradients flow backward by recursive application of the Chain Rule.

This is what every modern autograd framework (PyTorch, JAX, TensorFlow) is built on.

Why we need it

The naïve alternative — derive by hand — fails for three reasons (CS231n Lec 4):

  1. Tedious: lots of matrix calculus, lots of paper
  2. Brittle: change the loss (e.g. softmax → hinge) and you re-derive everything from scratch
  3. Infeasible: complex models like a CNN or a Neural Turing Machine have computation graphs with hundreds of nodes — no human writes that derivative out

A graph + backprop solves all three: write the forward pass once, get gradients for free.

Simple example:

Decompose into two gates: , then .

With : forward gives , .

Backward (set ):

  • Multiply gate: ,
  • Add gate: ,

Each gate only needs to know its local gradient (). The chain rule multiplies in the upstream gradient () flowing back from later nodes.

Patterns in gradient flow

Each gate type has a characteristic backward behavior — useful for sanity-checking gradients:

  • add gategradient distributor: copies the upstream gradient unchanged to all inputs
  • mul gateswap multiplier: , (swaps the other input in)
  • copy gate (a value used twice) → gradient adder: sum the upstream gradients from each branch
  • max gategradient router: passes the full upstream gradient to whichever input was the max; sends 0 to the others

Modular forward / backward API

Real frameworks expose each gate as a class with a forward and backward method. PyTorch’s torch.autograd.Function:

class Multiply(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        ctx.save_for_backward(x, y)  # cache values needed in backward
        return x * y
 
    @staticmethod
    def backward(ctx, grad_z):  # grad_z is the upstream dL/dz
        x, y = ctx.saved_tensors
        grad_x = y * grad_z   # local dz/dx = y, times upstream
        grad_y = x * grad_z
        return grad_x, grad_y

The forward pass must save any intermediate values the backward pass will need (here, both x and y). This is why training memory grows with depth — every layer’s activations stick around until backward consumes them. (See Activation Checkpointing for trading compute to reduce this.)

“Flat” backprop code

For a fixed graph you can write backprop inline, in reverse-order of the forward pass. Sigmoid example :

# Forward
s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2 + w2
L  = sigmoid(s3)
 
# Backward (reverse order)
grad_L  = 1.0
grad_s3 = grad_L * (1 - L) * L         # sigmoid local grad
grad_w2 = grad_s3                       # add gate distributes
grad_s2 = grad_s3
grad_s0 = grad_s2                       # add gate distributes
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1                  # mul gate swaps
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0

The structure mirrors the forward pass exactly — that’s the property autograd exploits.

Source

CS231n Lec 4 slides 54–116 (computational graphs, simple example, gate patterns, sigmoid example, flat code, modular forward/backward API).