MVDRAM: New System Accelerates LLM Inference Using Unmodified DRAM, Achieving 7.29x Speedup and 30.5x Energy Efficiency for Low-Bit Operations

MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration

View PDF HTML (experimental) Abstract:General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modifications. However, applying PUD to GeMV operations in the LLM ...