Decoupled DiLoCo: A new frontier for resilient, distributed AI training
Our new distributed architecture helps to train LLMs across distant data centers - with lower bandwidth and more hardware resiliency.Training a frontier AI model traditionally depends on a large, tightly coupled system in which identical chips must stay in near-perfect synchronization. This approach is highly effective for today’s state-of-the-art models, but as we look toward future generations of scale, maintaining this level of synchronization across thousands of chips becomes a significant l...
Read more at deepmind.google