Backpropagation

Backpropagation is at the core of any modern Deep Learning model.

Backpropagation is a way of computing gradients of expressions through recursive application of chain rule. it is usually used to compute the Gradient of the loss with respect to every variable.

After you do a pass of backpropagation to compute all of the gradients, you update the weights through Gradient Descent.

Real life patches them into tensors. Allows us to take advantage.

From Calculus, I learned that the Gradient is the vector of partial derivatives. In ML, when we say “the gradient on $x$ ”, we mean “the partial derivative on $x$ ”, so the gradient value for $x$ term.

The gradient flows by multiplication through the Chain Rule, Here is a simple example:

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(5.0, label='c')
e = a*b; e.label = 'e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = f * d; L.label='L'
 
 
# Manually calculating
# node.grad = upwards gradient * gradient with respect to other terms of the equation
L.grad = 1.0
f.grad = L.grad * d
d.grad = L.grad * f
e.grad = d.grad
 
# Automatically (*magic*, no I actually understand this code-ish now)
L.backward()

Above, the gradient of e (derivative of “L” with respect to “e”) was calculated by doing $\frac{d L}{d e} = \frac{d L}{dd} \frac{dd}{d e}$ , where $\frac{dd}{d e} = 1$ (since $d = e + c$ ) and $\frac{d L}{d e} = d . g r a d$ , which is calculated in the upper layer.

Specifically, you see that addition is simply a distributor or gradient, so it will just flow.

We propagate backwards to update our weights through Gradient Descent.

Backpropagation is computed starting from the back, because the layers at the front depend on the layers at the back to reduce the loss (due to Chain Rule).

Lessons: Update the gradient using +=, instead of overwriting with =, which would lead in the wrong value calculated. This allows gradient to be accumulated, so that for example:

b = a + a
b.backward()
 
a.grad # 1 if implemented IMPROPERLY
 
# Answer:
a.grad # 2
# You get that, since under the hood it calls:
def __add__(self, other):
	out = Value(self.data + other.data, (self, other), '+')
	
	def _backward(): # is this a closure?
		self.grad += 1.0 * out.grad # IMPORTANT Use +=, NOT =
		other.grad += 1.0 * out.grad
	out._backward = _backward
	return out

Computational Graphs

These are actually amazing because you get to decide the level of abstraction at which you operate the operations. So you can implement tanh as an operation, or really break it down. Key thing is that you need to tell it how the local gradient is calculated.

Example (this is vectorized, but the idea is the same)

Imagine you have an input data $X$ and layers layers $W_{1}$ and $W_{2}$ ( $W_{1}$ more on the left). the equations could be $W_{2} = W_{1} X$ $L = k W_{2}$ If you want to calculate the loss with respect to the first layer, that depends on the second layer, so you need the loss with respect to the second layer first. $\frac{d L}{d W _{1}} = \frac{d L}{d W _{2}} \frac{d W _{2}}{d W _{1}}$

Backpropagation is beautiful because it is a very local process, we only care about the local gradients and using the cached gradients in the upper layer, multiply those, and keep propagating those backwards.

Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.

https://cs231n.github.io/optimization-2/

To compute vectorized, you have the Jacobian Matrix.

Misc

Andrej Karpathy has a really good article on why understand this is important.

🛠️ Steven Gong

Table of Contents

Backpropagation

Computational Graphs

Example (this is vectorized, but the idea is the same)

Misc

Graph View

Backlinks