Tensor Memory Accelerator (TMA)

Is this just a fancy GPU DMA?

Kinda, but there’s more to it. Your GPUs still have DMA.

What’s “DMA-like” about it

It’s hardware moving data for you asynchronously, so your compute threads don’t spend tons of instructions doing loads/stores.
You can overlap “move this tile” with “compute on the previous tile,” which feels like a DMA engine.

What makes TMA different from a generic DMA

It’s SM-local and tile/ndim aware. You describe a multi-dimensional tile (shape + strides) and TMA moves it directly into shared memory in the layout you want. A classic DMA is usually “copy this linear range.”
It integrates with the SM scheduling/pipelines. The copy is meant to feed tensor-core style tiled kernels (GEMM/attention) with minimal instruction overhead and tight synchronization semantics (barriers/arrive-wait patterns).
It’s optimized for shared memory tiling, not just device↔device memcpy. Think “specialized data-mover for kernels,” not a general-purpose copy engine like the GPU’s copy engines for cudaMemcpyAsync.

🛠️ Steven Gong