MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration
View PDF
HTML (experimental)
Abstract:General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modifications. However, applying PUD to GeMV operations in the LLM ...
Read more at arxiv.org