DeepReinforce AI's CUDA-L2 uses LLMs and reinforcement learning to optimize GPU matrix multiplication, outperforming NVIDIA's cuBLAS on A100 across 1,000 configurations.

GitHub - deepreinforce-ai/CUDA-L2: CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning 🥳 Introduction CUDA-L2 is a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art NVIDIA closed-source libraries (cuBLAS, cuBLASLt-heuristic, cuBLASLt-AutoTuning). Pap...