News Score: Score the News, Sort the News, Rewrite the Headlines

Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]

It’s 2022. I check out this cool new project, CUTLASS, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.python mm_bench.py > CuBLAS: 258 TeraflopsNot bad, 83% flop utilization. Now let’s check out Cutlass’s performance using their profiler../cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192 > CUTLASS: 288 Teraflops!!! 10% higher perf? That’s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and so...

Read more at thonking.ai

© News Score  score the news, sort the news, rewrite the headlines