Matrix Multiplication (Compute)
You really need to fundamentally understand matrix multiplication from a computational perspective. It is a fundamental building block behind all of deep learning.
Matrix multiplication resources
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
- https://siboehm.com/articles/22/Fast-MMM-on-CPU (CPU level multiplication)
- https://siboehm.com/articles/22/CUDA-MMM (GPU level multiplication)
Matrix multiplication is done in . You need 3 for loops, 2 for loops to index into of the destination matrix, and one for loop to calculate the dot product between row i and column j.
- See Matrix Multiplication for refresher on basics
The most simple implementation:
template <int rows, int columns, int inners>
inline void matmulImplNaive(const float *left, const float *right,
float *result) {
for (int row = 0; row < rows; row++) {
for (int col = 0; col < columns; col++) {
for (int inner = 0; inner < inners; inner++) {
result[row * columns + col] +=
left[row * columns + inner] * right[inner * columns + col];
}
}
}
}
Order matters here for large matrices because of cache misses.
Remember that matrices are laid out like this in memory (See Linearized Array):
What this means is that when we load a row, we are pretty efficient. However, when we load a column, we might need to do multiple loads because each element of the column is not contiguous in memory.
Consider the first pass of the iteration.
Consider that each cache line can only hold 4 elements. If you tried to go through an entire column, each time you tried to read a new column element, you would induce a Cache Miss and require a new memory load into the cache.
This isnβt actually too bad for small matrices, where the matrix fits entirely in cache. After all the cache misses, you eventually have the matrix loaded.
However, you can imagine for super large matrices.
In General
You really want to minimize the amount of time that you are spending reading / writing from main memory. You want to maximize Spatial Locality of data.
- This means that if your data is in cache, try to read all of it.
A really simple but effective improvement is reorder the loop:
template <int rows, int columns, int inners>
inline void matmulImplNaive(const float *left, const float *right,
float *result) {
for (int row = 0; row < rows; row++) {
for (int inner = 0; inner < inners; inner++) {
for (int col = 0; col < columns; col++) {
result[row * columns + col] +=
left[row * columns + inner] * right[inner * columns + col];
}
}
}
}
Matrix Multiplication via Tiling
Tiling is the next optimization technique that we will look at. Again, we try to maximize data locality.
Instead of processing one element at a time tiling divides the matrices into smaller sub-matrices (tiles) of size . This allows better use of fast cache memory by reducing the need to frequently fetch data from slower RAM.
We divide and into square blocks (tiles) of size . Then, we compute the matrix product block by block.
Assume:
- and are matrices.
- Tile size
We split and like this:
Each and is a tile.
We compute like this:
This method allows each block to load a ( T \times T ) chunk into memory at once, improving:
- Cache efficiency (data stays in cache longer)
- Parallelism (different tiles can be computed simultaneously on separate threads/cores)
Why does this help?
- Memory is hierarchical: fetching from cache is much faster than from RAM.
- Tiling minimizes cache misses by maximizing the reuse of loaded data.
- GPUs and SIMD (Single Instruction, Multiple Data) instructions are designed to process blocks of data in parallel.
In practice
for i in range(0, N, T): # Loop over tiles of A
for j in range(0, N, T): # Loop over tiles of B
for k in range(0, N, T): # Loop over the depth dimension
# Multiply tile A[i:i+T, k:k+T] with tile B[k:k+T, j:j+T]
for ii in range(i, i+T):
for jj in range(j, j+T):
for kk in range(k, k+T):
C[ii][jj] += A[ii][kk] * B[kk][jj]
The original set of excalidraw drawings