Cache Blocking / Tiling
Tiling makes use of CUDA Shared Memory so that you reduce the amount of times you are reading from global memory when executing a CUDA kernel.
Introduced to me through this article https://siboehm.com/articles/22/Fast-MMM-on-CPU
This is the MOST FUNDAMENTAL thing, introduced in chapter 4.4 of PMPP.
With tiled matrix multiplication, threads to collaboratively load subsets of the and elements into the shared memory before they individually use these elements in their dot product calculation.
I don’t get the names though
d_M
= matrixM
on device memory (GPU)- so I assume
h_M
means matrixM
on host memory (CPU)
- so I assume
Mds
=Md
shared,M
on device
Reading this code
You can see that the shared memory is done across a block.
Notice that there are 2 __syncthreads()
calls. Each of them serve a different purpose.
Line 11 ensures that all threads have finished loading the tiles of M
and N
into Mds
and Nds
before any of them can move forward.
Second call: Ensures that threads have finished using the M and N elements in the shared memory before any of them move on to the next iteration and load the elements from the next tiles. In this manner, none of the threads would load the elements too early and corrupt the input values for other threads.
The kernel cannot easily adjust its shared memory usage at runtime without recompilation.
Boundary Checks
We add boundary checks for tiling in the event of uneven matrices.