News Score: Score the News, Sort the News, Rewrite the Headlines

Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

In the first blog post in this series we explained Nvidia's Blackwell GPU architecture and concluded with a 4 line kernel that was a bit worse than cuBLAS. In fact, the performance was a lot worse coming in at 0.3% of cuBLAS and leaving 1758 TFLops on the table.In this post we are going to continue our journey and improve our performance by more than 50x our initial kernel benchmark. Along the way we are going to explain more GPU programming concepts and leverage novel Blackwell features. Note t...

Read more at modular.com

© News Score  score the news, sort the news, rewrite the headlines