Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]
It’s 2022. I check out this cool new project, CUTLASS, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.python mm_bench.py
> CuBLAS: 258 TeraflopsNot bad, 83% flop utilization. Now let’s check out Cutlass’s performance using their profiler../cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192
> CUTLASS: 288 Teraflops!!! 10% higher perf? That’s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and so...
Read more at thonking.ai