Training a frontier AI model has traditionally relied on a massive, tightly integrated system where identical chips must remain in near-perfect synchronization. This method works extremely well with current leading models, but scaling it efficiently to future systems with thousands of chips poses major logistical hurdles. In a new paper released today, we introduce a promising solution called Decoupled DiLoCo (Distributed Low-Communication). By splitting up large training runs into separate „islands“ of compute that operate independently, with asynchronous data exchange between them, this design contains local failures so the rest of the system can continue learning without interruption. The outcome is a far more resilient and adaptable approach to training cutting-edge models across globally distributed data centers. And importantly, Decoupled DiLoCo avoids the communication delays that previously rendered distributed techniques like Data-Parallel impractical at global scale. As frontier models continue to increase in size and complexity, we are investigating a range of methods for training across larger amounts of compute, different geographic locations, and heterogeneous hardware. Figure 1: By splitting training runs into separate „islands“ of compute (learner units), large-scale training can continue with minimal interruption even under the same rate of hardware failures, since the impact of those failures remains isolated. This approach advances the development of highly fault-tolerant asynchronous training at scale. Decoupled DiLoCo combines two prior breakthroughs: Pathways, which introduced an asynchronous data-flow-based distributed AI system, and DiLoCo, which sharply reduced the bandwidth needed between data centers, enabling practical training of large language models across remote locations. By merging these concepts, Decoupled DiLoCo makes it possible to train AI models at scale with greater flexibility. It is built on Pathways and supports asynchronous training across independent islands of compute (called learner units). As a result, a chip failure in one area does not halt the progress of the others. The infrastructure is also self-healing. During our tests, we applied a technique known as „chaos engineering“ to simulate hardware failures while models were training.
Google DeepMind News