MegaTrain system trains 100B+ parameter AI models at full precision on single GPU by storing parameters in CPU memory, streaming to GPU; achieves 1.84x faster throughput than DeepSpeed on H200 hardware.

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

View PDF HTML (experimental) Abstract:We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt ...