Matrix Multiplication

Matrix Multiplication (Compute)

Need to really fundamentally understand this from a compute perspective.

This is one of the most fundamental building blocks for fast computation.

Matrix multiplication resources

Wow, this is actually kind of complicated.

The most simple implementation

template <int rows, int columns, int inners>
inline void matmulImplNaive(const float *left, const float *right,
                            float *result) {
  for (int row = 0; row < rows; row++) {
    for (int col = 0; col < columns; col++) {
      for (int inner = 0; inner < inners; inner++) {
        result[row * columns + col] +=
            left[row * columns + inner] * right[inner * columns + col];
      } 
    } 
  }
}

Matrix Multiplication via Tiling

Instead of processing one element at a time tiling divides the matrices into smaller submatrices (tiles) of size . This allows better use of fast cache memory by reducing the need to frequently fetch data from slower RAM.

We divide and into square blocks (tiles) of size . Then, we compute the matrix product block by block.

Assume:

  • and are matrices.
  • Tile size

We split and like this:

A_{00} & A_{01} \\ A_{10} & A_{11} \\ \end{bmatrix}, \quad B = \begin{bmatrix} B_{00} & B_{01} \\ B_{10} & B_{11} \\ \end{bmatrix}

Each and is a tile.

We compute like this:

This method allows each block to load a ( T \times T ) chunk into memory at once, improving:

  • Cache efficiency (data stays in cache longer)
  • Parallelism (different tiles can be computed simultaneously on separate threads/cores)

Why does this help?

  • Memory is hierarchical: fetching from cache is much faster than from RAM.
  • Tiling minimizes cache misses by maximizing the reuse of loaded data.
  • GPUs and SIMD (Single Instruction, Multiple Data) instructions are designed to process blocks of data in parallel.

In practice

for i in range(0, N, T):  # Loop over tiles of A
    for j in range(0, N, T):  # Loop over tiles of B
        for k in range(0, N, T):  # Loop over the depth dimension
            # Multiply tile A[i:i+T, k:k+T] with tile B[k:k+T, j:j+T]
            for ii in range(i, i+T):
                for jj in range(j, j+T):
                    for kk in range(k, k+T):
                        C[ii][jj] += A[ii][kk] * B[kk][jj]