Tensor Memory Accelerator (TMA)

Learned about it through this blog https://fleetwood.dev/posts/domain-specific-architectures#google-tpu.

Is this just a fancy GPU DMA?

Kinda, but there’s more to it. Your GPUs still have DMA.

What’s “DMA-like” about it

  • It’s hardware moving data for you asynchronously, so your compute threads don’t spend tons of instructions doing loads/stores.
  • You can overlap “move this tile” with “compute on the previous tile,” which feels like a DMA engine.

What makes TMA different from a generic DMA

  • It’s SM-local and tile/ndim aware. You describe a multi-dimensional tile (shape + strides) and TMA moves it directly into shared memory in the layout you want. A classic DMA is usually “copy this linear range.”

  • It integrates with the SM scheduling/pipelines. The copy is meant to feed tensor-core style tiled kernels (GEMM/attention) with minimal instruction overhead and tight synchronization semantics (barriers/arrive-wait patterns).

  • It’s optimized for shared memory tiling, not just device↔device memcpy. Think “specialized data-mover for kernels,” not a general-purpose copy engine like the GPU’s copy engines for cudaMemcpyAsync.