CMU-Led Team Develops MPK Compiler: Slashes LLM Inference Latency by 1.2-6.7x Through GPU Megakernel Fusion

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

TL;DR: We developed a compiler that automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs all necessary computation and communication in one launch. This end-to-end GPU fusion approach reduces LLM inference latency by 1.2-6.7x. Our compiler is easy to use — you can compile your LLM into a high-performance megakernel with just a few dozen lines of Python.What’s the key idea? Traditional LLM systems often rely on sequences of GPU kernel launches and ext...