Stephen Jones

He gave some fundamental keynotes on CUDA.

The GPU is a throughput machine. The CPU is a latency machine.

Steven, you love fighting Bottleneck.

  • Well, the bottleneck is not FLOPS, it’s memory bandwidth

FLOPS aren’t the issue - bandwidth is the issue.

Fundmental Difference

  • CPU cuts out latency
  • GPU designers don’t care about latency as much, they increase bandwith (15:00 of video)

He explains how DRAM works.

Sense Amplifier on the RAM

Efficient use of resources drives performance.

He talks about how each SM can managed 64 warps, so a total of 2048 threads. However, it really processes 4 warps at a time, which

The memory page size is exactly 1024 bytes.

  • A SM is actually running 4 warps at the same time, the rest are kept in a queue

For a single thread, this looks like random-address memory reads. It’s actually adjacent reads of whole pages of memory.

One SM per block!

A block runs on a single SM. It can never span 2 different SMs.

Blocks get placed

Is there a way to sanity-check occupancy?


He talks a little more about CUDA streams.

  • This is how you pack together different streams of block