Matrix Multiplication (Compute)
Need to really fundamentally understand this from a compute perspective.
This is one of the most fundamental building blocks for fast computation.
Matrix multiplication resources
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
- https://siboehm.com/articles/22/Fast-MMM-on-CPU (CPU level multiplication)
- https://siboehm.com/articles/22/CUDA-MMM (GPU level multiplication)
Wow, this is actually kind of complicated.
The most simple implementation
template <int rows, int columns, int inners>
inline void matmulImplNaive(const float *left, const float *right,
float *result) {
for (int row = 0; row < rows; row++) {
for (int col = 0; col < columns; col++) {
for (int inner = 0; inner < inners; inner++) {
result[row * columns + col] +=
left[row * columns + inner] * right[inner * columns + col];
}
}
}
}
Matrix Multiplication via Tiling
Instead of processing one element at a time tiling divides the matrices into smaller submatrices (tiles) of size . This allows better use of fast cache memory by reducing the need to frequently fetch data from slower RAM.
We divide and into square blocks (tiles) of size . Then, we compute the matrix product block by block.
Assume:
- and are matrices.
- Tile size
We split and like this:
A_{00} & A_{01} \\ A_{10} & A_{11} \\ \end{bmatrix}, \quad B = \begin{bmatrix} B_{00} & B_{01} \\ B_{10} & B_{11} \\ \end{bmatrix}Each and is a tile.
We compute like this:
This method allows each block to load a ( T \times T ) chunk into memory at once, improving:
- Cache efficiency (data stays in cache longer)
- Parallelism (different tiles can be computed simultaneously on separate threads/cores)
Why does this help?
- Memory is hierarchical: fetching from cache is much faster than from RAM.
- Tiling minimizes cache misses by maximizing the reuse of loaded data.
- GPUs and SIMD (Single Instruction, Multiple Data) instructions are designed to process blocks of data in parallel.
In practice
for i in range(0, N, T): # Loop over tiles of A
for j in range(0, N, T): # Loop over tiles of B
for k in range(0, N, T): # Loop over the depth dimension
# Multiply tile A[i:i+T, k:k+T] with tile B[k:k+T, j:j+T]
for ii in range(i, i+T):
for jj in range(j, j+T):
for kk in range(k, k+T):
C[ii][jj] += A[ii][kk] * B[kk][jj]