Pinned Memory
Pinned (page-locked) memory is host RAM the OS promises not to swap, keeping those pages in physical memory so the GPU’s DMA engine can read directly from them.
Why does regular memory need this?
Normal host memory is pageable, the OS can swap it to disk at any time. A
cudaMemcpyfrom pageable RAM has to first copy into a pinned staging buffer, then DMA to GPU. That first copy is pure overhead. Pinned memory skips it.
Resource:
Payoff
- Higher effective bandwidth: saturates PCIe instead of being bottlenecked by the staging copy
- Enables async H2D:
cudaMemcpyAsyncrequires pinned source to actually be asynchronous. Without it, the call silently blocks - Unlocks compute on a CUDA stream, which is how data-loading latency gets hidden behind the training step
PyTorch usage
loader = DataLoader(dataset, pin_memory=True, num_workers=4)
for batch in loader:
batch = batch.to('cuda', non_blocking=True) # async only works with pinned source
...Writing it manually:
x = torch.randn(1000, 1000).pin_memory()
y = x.to('cuda', non_blocking=True)See PyTorch Performance Tuning for the full recipe.
Trade-off
Pinned pages can’t be swapped, so over-allocating pressures OS memory and can destabilize the system. Pin data-loader buffers, not everything.