Tensor Processing Unit (TPU)

This is specialized hardware to train Neural Nets.

Sold by Google.

Resources

TPUs work more like a conveyor belt.

Even for matmul, the GPU typically does something like:

What a systolic array is doing (spatial dataflow)

A systolic array is more like a conveyor belt for matrix multiply:

You “inject” A values from one edge and B values from another edge.
Each processing element (PE) does:
acc += a * b
then forwards a to its neighbor and b to its other neighbor.
The key: a and b are reused by physically moving through neighbors, not by being repeatedly fetched from a big register file.