Memory-Bound Program
Notes copied from Chapter 4 of PMPP.
Memory-Bound Program
Memory-bound programs are programs whose execution speed is limited by memory access throughput.
One of the fundamental concept is the compute-to-global-memory-access ratio.
Memory is expensive
Compute is a lot faster than memory access, so you want this number to be as high as possible.
They motivate this through matrix multiplication.
__global__ void MatrixMulKernel(float* M, float* N, float* P,
int Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}
The above is slow
Every single time you are reading into
M[...]
,N[...]
, orP[...]
, that is a read from global memory.
EXERCISE: Calculate the compute-to-memory-acess ratio. It is 1.0
Fundamental question to answer
Is memory access done per thread, per block, or per warp?
- It seems to be done per warp basis. More of this is covered in Chapter 5