Researchers identify memory, interconnect bottlenecks in LLM inference; propose High Bandwidth Flash, 3D stacking, low-latency solutions for datacenter AI hardware.

Challenges and Research Directions for Large Language Model Inference Hardware

View PDF Abstract:Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stackin...