Corner Turning
Remember that matrices are generally stored in Row-Major Layout. This is inefficient for mat-mat multiplication, where the second matrix is also stored in row-major, but accessed in a column-wise way.
- This means that no Memory Coalescing can be done, so it is much slower
The solution? Store the second input matrix in a column-major layout. This technique is called corner turning.