SparseLoCo: New Algorithm Slashes LLM Training Costs, Achieves 97-99% Data Compression While Boosting Performance

Communication Efficient LLM Pre-training with SparseLoCo

View PDF HTML (experimental) Abstract:Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across data centers and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Further...