Backpropagation
Backpropagation is at the core of any modern Deep Learning model.
Backpropagation is a way of computing gradients of expressions through recursive application ofΒ chain rule. it is usually used to compute the Gradient of the loss with respect to every variable.
After you do a pass of backpropagation to compute all of the gradients, you update the weights through Gradient Descent.
Real life patches them into tensors. Allows us to take advantage.
From Calculus, I learned that the Gradient is the vector of partial derivatives. In ML, when we say βthe gradient on β, we mean βthe partial derivative on β, so the gradient value for term.
The gradient flows by multiplication through the Chain Rule, Here is a simple example:
Above, the gradient of e (derivative of βLβ with respect to βeβ) was calculated by doing , where (since ) and , which is calculated in the upper layer.
Specifically, you see that addition is simply a distributor or gradient, so it will just flow.
We propagate backwards to update our weights through Gradient Descent.
Backpropagation is computed starting from the back, because the layers at the front depend on the layers at the back to reduce the loss (due to Chain Rule).
Lessons: Update the gradient using +=
, instead of overwriting with =, which would lead in the wrong value calculated. This allows gradient to be accumulated, so that for example:
Computational Graphs
These are actually amazing because you get to decide the level of abstraction at which you operate the operations. So you can implement tanh
as an operation, or really break it down. Key thing is that you need to tell it how the local gradient is calculated.
Example (this is vectorized, but the idea is the same)
Imagine you have an input data and layers layers and ( more on the left). the equations could be If you want to calculate the loss with respect to the first layer, that depends on the second layer, so you need the loss with respect to the second layer first.
Backpropagation is beautiful because it is a very local process, we only care about the local gradients and using the cached gradients in the upper layer, multiply those, and keep propagating those backwards.
Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.
https://cs231n.github.io/optimization-2/
To compute vectorized, you have the Jacobian Matrix.
Misc
Andrej Karpathy has a really good article on why understand this is important.