Modular optimizes matrix multiplication on Nvidia Blackwell GPUs: 50x performance boost using shared memory and loop tiling

Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

In the first blog post in this series we explained Nvidia's Blackwell GPU architecture and concluded with a 4 line kernel that was a bit worse than cuBLAS. In fact, the performance was a lot worse coming in at 0.3% of cuBLAS and leaving 1758 TFLops on the table.In this post we are going to continue our journey and improve our performance by more than 50x our initial kernel benchmark. Along the way we are going to explain more GPU programming concepts and leverage novel Blackwell features. Note t...