Matrix Multiplication (Compute)
Need to really fundamentally understand this from a compute perspective.
This is one of the most fundamental building blocks for fast computation.
Matrix multiplication resources
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
- https://siboehm.com/articles/22/CUDA-MMM
Wow, this is actually kind of complicated.
The most simple implementation