Tensor Processing Unit (TPU)

This is specialized hardware to train Neural Nets.

Sold by Google.

https://cloud.google.com/tpu/docs/intro-to-tpu

Systolic Array

Resources

TPUs work more like a conveyor belt.

Even for matmul, the GPU typically does something like:

  • Load tiles of A/B from HBM → L2 → shared memory (or caches)
  • Each warp loads fragments into registers
  • Tensor Core does a small MMA (e.g., 16Ă—16Ă—16)
  • Accumulate partial sums in registers
  • Eventually write results back to shared/HBM

What a systolic array is doing (spatial dataflow)

A systolic array is more like a conveyor belt for matrix multiply:

  • You “inject” A values from one edge and B values from another edge.
  • Each processing element (PE) does:
    acc += a * b
    then forwards a to its neighbor and b to its other neighbor.
  • The key: a and b are reused by physically moving through neighbors, not by being repeatedly fetched from a big register file.