Tensor Processing Unit (TPU)
This is specialized hardware to train Neural Nets.
Sold by Google.
https://cloud.google.com/tpu/docs/intro-to-tpu
Resources
TPUs work more like a conveyor belt.
Even for matmul, the GPU typically does something like:
- Load tiles of A/B from HBM → L2 → shared memory (or caches)
- Each warp loads fragments into registers
- Tensor Core does a small MMA (e.g., 16Ă—16Ă—16)
- Accumulate partial sums in registers
- Eventually write results back to shared/HBM
What a systolic array is doing (spatial dataflow)
A systolic array is more like a conveyor belt for matrix multiply:
- You “inject” A values from one edge and B values from another edge.
- Each processing element (PE) does:
acc += a * b
then forwards a to its neighbor and b to its other neighbor. - The key: a and b are reused by physically moving through neighbors, not by being repeatedly fetched from a big register file.