Google's Decoupled DiLoCo trains large AI models across distant data centers with low bandwidth, self-healing when hardware fails; 12B parameter model trained 20x faster across four US regions using standard internet connectivity.

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

Our new distributed architecture helps to train LLMs across distant data centers - with lower bandwidth and more hardware resiliency.Training a frontier AI model traditionally depends on a large, tightly coupled system in which identical chips must stay in near-perfect synchronization. This approach is highly effective for today’s state-of-the-art models, but as we look toward future generations of scale, maintaining this level of synchronization across thousands of chips becomes a significant l...