Basic Linear Algebra Subprograms (BLAS)
Running into this as I try to implement my own Accelerated Eigen.
This is fire: https://siboehm.com/articles/22/CUDA-MMM
NVIDIA has an implementation called CuBLAS.
There are 3 levels
Level 1 BLAS: Vector-vector Operations.
Level 2 BLAS: Matrix-vector operations.
Level 3 BLAS: Matrix-matrix operations.