Tensor Memory Accelerator (TMA)
Learned about it through this blog https://fleetwood.dev/posts/domain-specific-architectures#google-tpu.
Is this just a fancy GPU DMA?
Kinda, but there’s more to it. Your GPUs still have DMA.
What’s “DMA-like” about it
- It’s hardware moving data for you asynchronously, so your compute threads don’t spend tons of instructions doing loads/stores.
- You can overlap “move this tile” with “compute on the previous tile,” which feels like a DMA engine.
What makes TMA different from a generic DMA
-
It’s SM-local and tile/ndim aware. You describe a multi-dimensional tile (shape + strides) and TMA moves it directly into shared memory in the layout you want. A classic DMA is usually “copy this linear range.”
-
It integrates with the SM scheduling/pipelines. The copy is meant to feed tensor-core style tiled kernels (GEMM/attention) with minimal instruction overhead and tight synchronization semantics (barriers/arrive-wait patterns).
-
It’s optimized for shared memory tiling, not just device↔device memcpy. Think “specialized data-mover for kernels,” not a general-purpose copy engine like the GPU’s copy engines for
cudaMemcpyAsync.