Memory-Bound Program

Notes copied from Chapter 4 of PMPP.

Memory-Bound Program

Memory-bound programs are programs whose execution speed is limited by memory access throughput.

One of the fundamental concept is the compute-to-global-memory-access ratio.

Memory is expensive

Compute is a lot faster than memory access, so you want this number to be as high as possible.

They motivate this through matrix multiplication.

__global__ void MatrixMulKernel(float* M, float* N, float* P,
int Width) {
  // Calculate the row index of the P element and M
  int Row = blockIdx.y*blockDim.y+threadIdx.y;
  // Calculate the column index of P and N
  int Col = blockIdx.x*blockDim.x+threadIdx.x;
  if ((Row < Width) && (Col < Width)) {
    float Pvalue = 0;
    // each thread computes one element of the block sub-matrix
    for (int k = 0; k < Width; ++k) {
      Pvalue += M[Row*Width+k]*N[k*Width+Col];
    }
    P[Row*Width+Col] = Pvalue;
  }
}

The above is slow

Every single time you are reading into M[...], N[...], or P[...], that is a read from global memory.

EXERCISE: Calculate the compute-to-memory-acess ratio. It is 1.0

Fundamental question to answer

Is memory access done per thread, per block, or per warp?

It seems to be done per warp basis. More of this is covered in Chapter 5

🛠️ Steven Gong

Memory-Bound Program

Graph View

Backlinks