Basic Linear Algebra Subprograms (BLAS)

Running into this as I try to implement my own Accelerated Eigen.

This is fire: https://siboehm.com/articles/22/CUDA-MMM

NVIDIA has an implementation called CuBLAS.

There are 3 levels

Level 1 BLAS: Vector-vector Operations.

Level 2 BLAS: Matrix-vector operations.

Level 3 BLAS: Matrix-matrix operations.