Memory-Bound Program

Notes copied from Chapter 4 of PMPP.

Memory-Bound Program

Memory-bound programs are programs whose execution speed is limited by memory access throughput.

One of the fundamental concept is the compute-to-global-memory-access ratio.

Memory is expensive

Compute is a lot faster than memory access, so you want this number to be as high as possible.

They motivate this through matrix multiplication.

__global__ void MatrixMulKernel(float* M, float* N, float* P,
int Width) {
  // Calculate the row index of the P element and M
  int Row = blockIdx.y*blockDim.y+threadIdx.y;
  // Calculate the column index of P and N
  int Col = blockIdx.x*blockDim.x+threadIdx.x;
  if ((Row < Width) && (Col < Width)) {
    float Pvalue = 0;
    // each thread computes one element of the block sub-matrix
    for (int k = 0; k < Width; ++k) {
      Pvalue += M[Row*Width+k]*N[k*Width+Col];
    }
    P[Row*Width+Col] = Pvalue;
  }
}

The above is slow

Every single time you are reading into M[...], N[...], or P[...], that is a read from global memory.

EXERCISE: Calculate the compute-to-memory-acess ratio. It is 1.0

Fundamental question to answer

Is memory access done per thread, per block, or per warp?

  • It seems to be done per warp basis. More of this is covered in Chapter 5