GPU matrix multiplication speeds vary up to 15% based on input data values due to dynamic power consumption affecting clock throttling on Nvidia A100s, researcher discovers in 2022.

Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]

It’s 2022. I check out this cool new project, CUTLASS, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.python mm_bench.py > CuBLAS: 258 TeraflopsNot bad, 83% flop utilization. Now let’s check out Cutlass’s performance using their profiler../cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192 > CUTLASS: 288 Teraflops!!! 10% higher perf? That’s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and so...