CUDA Optimization

Corner Turning

Remember that matrices are generally stored in Row-Major Layout. This is inefficient for mat-mat multiplication, where the second matrix is also stored in row-major, but accessed in a column-wise way.

The solution? Store the second input matrix in a column-major layout. This technique is called corner turning.