# Memory-Bound Program

Notes copied from Chapter 4 of PMPP.

Memory-Bound Program

Memory-bound programsare programs whose execution speed is limited by memory access throughput.

One of the fundamental concept is the **compute-to-global-memory-access ratio**.

Memory is expensive

Compute is a lot faster than memory access, so you want this number to be as high as possible.

They motivate this through matrix multiplication.

```
__global__ void MatrixMulKernel(float* M, float* N, float* P,
int Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}
```

The above is slow

Every single time you are reading into

`M[...]`

,`N[...]`

, or`P[...]`

, that is a read from global memory.

EXERCISE: Calculate the compute-to-memory-acess ratio. It is 1.0

Fundamental question to answer

Is memory access done per thread, per block, or per warp?

- It seems to be done per warp basis. More of this is covered in Chapter 5