Decoupled DiLoCo: A new frontier for resilient, distributed AI training
Decoupled DiLoCo: Resilient, Distributed AI Training at Scale — Google DeepMind
Explore our next generation AI systems
Gemini

Specialized models

World models & embodied AI

Open models

Our latest AI breakthroughs and updates from the lab
Breakthroughs

Learn more
EvalsPublicationsResponsibility
Unlocking a new era of discovery with AI
Life sciences

Climate and sustainability

Our mission is to build AI responsibly to benefit humanity
Responsibility Ensuring AI safety through proactive security, even against evolving threatsNews Discover our latest AI breakthroughs, projects, and updatesCareers We’re looking for people who want to make a real, positive impact on the world
Education We work to make AI more accessible to the next generationOur National Partnerships for AI Working with governments worldwide to benefit people through frontier AIThe Podcast Join Professor Hannah Fry as she uncovers the extraordinary way AI is transforming our world
Models
Explore our next generation AI systems
Gemini

Specialized models

World models & embodied AI

Open models

Research
Our latest AI breakthroughs and updates from the lab
Breakthroughs

Learn more
EvalsPublicationsResponsibility
Science
Unlocking a new era of discovery with AI
Life sciences

Climate and sustainability

About
Our mission is to build AI responsibly to benefit humanity
Responsibility Ensuring AI safety through proactive security, even against evolving threatsNews Discover our latest AI breakthroughs, projects, and updatesCareers We’re looking for people who want to make a real, positive impact on the worldEducation We work to make AI more accessible to the next generationOur National Partnerships for AI Working with governments worldwide to benefit people through frontier AIThe Podcast Join Professor Hannah Fry as she uncovers the extraordinary way AI is transforming our world
Google AI Learn about all our AIGoogle DeepMind Explore the frontier of AIGoogle Labs Try our AI experimentsGoogle Research Explore our research
Products and apps
Gemini app Chat with GeminiGoogle AI Studio Build with our next-gen AI modelsGoogle Antigravity Our agentic development platform
April 23, 2026 Research
Decoupled DiLoCo: A new frontier for resilient, distributed AI training
Arthur Douillard and the DiLoCo team
- [x]
Share
[](https://twitter.com/intent/tweet?url=https://deepmind.google/blog/decoupled-diloco/&text=Decoupled%20DiLoCo%3A%20A%20new%20frontier%20for%20resilient%2C%20distributed%20AI%20training)[](https://www.facebook.com/sharer/sharer.php?u=https://deepmind.google/blog/decoupled-diloco/)[](https://www.linkedin.com/sharing/share-offsite/?url=https://deepmind.google/blog/decoupled-diloco/)[](mailto:?subject=Decoupled%20DiLoCo%3A%20A%20new%20frontier%20for%20resilient%2C%20distributed%20AI%20training&body=https://deepmind.google/blog/decoupled-diloco/)Copied
Your browser does not support the video tag.
Our new distributed architecture helps to train LLMs across distant data centers - with lower bandwidth and more hardware resiliency.
Training a frontier AI model traditionally depends on a large, tightly coupled system in which identical chips must stay in near-perfect synchronization. This approach is highly effective for today’s state-of-the-art models, but as we look toward future generations of scale, maintaining this level of synchronization across thousands of chips becomes a significant logistical challenge.
Today, in a new paper we are excited to share a new approach to this problem, called Decoupled DiLoCo (Distributed Low-Communication). By dividing large training runs across decoupled “islands” of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently.
The result is a more resilient and flexible way to train advanced models across globally distributed data centers. And crucially, Decoupled DiLoCo does not suffer the communication delays that made previous distributed methods like Data-Parallel impractical at global scale.
As frontier models continue to grow in scale and complexity, we’re exploring diverse approaches to train models across more compute, locations and varied hardware.
Your browser does not support the video tag.
Figure 1: Decoupling training runs into separate “islands” of compute (learner units) allows largely uninterrupted training despite the same level of hardware failures, because the effects of those failures are isolated.
Developing more fault-tolerant asynchronous training at scale
Decoupled DiLoCo builds on two earlier advances: Pathways, which introduced a distributed AI system based on asynchronous data flow, and DiLoCo, which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations.
Decoupled DiLoCo brings those ideas together to train AI models more flexibly at scale. Built on top of Pathways, it enables asynchronous training across separate islands of compute (known as learner units) so that a chip failure in one area doesn’t interrupt the progress of the others.
This infrastructure is also self-healing. In testing, we used a method called “chaos engineering” to introduce artificial hardware failures during training runs. Decoupled DiLoCo continued the training process after the loss of entire learner units, and then seamlessly reintegrated them when they came back online.
Testing Decoupled DiLoCo with Gemma 4 models demonstrated that, when hardware fails, the system maintains greater availability of learning clusters than more traditional training methods — while ultimately delivering the same benchmarked level of machine learning (ML) performance.
!Image 43: This set of three bar charts compares the performance of Data-Parallel training against Decoupled DiLoCo across communication, resilience, and accuracy metrics. The first chart, Required Bandwidth, shows that DiLoCo reduces bandwidth needs from 198 Gbps to a mere 0.84 Gbps across 8 datacenters, representing a massive efficiency gain on a logarithmic scale. The second chart, Goodput, demonstrates that in a simulated environment of 1.2 million chips with high failure rates, DiLoCo maintains a 88% goodput compared to only 27% for standard Data-Parallel methods. Finally, the ML Benchmarks chart highlights that these gains come with virtually no cost to performance, as DiLoCo achieves 64.1% average accuracy, nearly matching the 64.4% achieved by the baseline.!Image 44: This set of three bar charts compares the performance of Data-Parallel training against Decoupled DiLoCo across communication, resilience, and accuracy metrics. The first chart, Required Bandwidth, shows that DiLoCo reduces bandwidth needs from 198 Gbps to a mere 0.84 Gbps across 8 datacenters, representing a massive efficiency gain on a logarithmic scale. The second chart, Goodput, demonstrates that in a simulated environment of 1.2 million chips with high failure rates, DiLoCo maintains a 88% goodput compared to only 27% for standard Data-Parallel methods. Finally, the ML Benchmarks chart highlights that these gains come with virtually no cost to performance, as DiLoCo achieves 64.1% average accuracy, nearly matching the 64.4% achieved by the baseline.
Figure 2: **Left**: The Decoupled DiLoCo approach requires orders of magnitude less bandwidth than conventional training methods, making it very efficient. **Middle**: With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of “goodput”, or useful training, while that of other approaches nosedives. (The first two charts are based on simulated training runs). **Right**: In real-world experiments, the benchmarked ML performance of Gemma 4 models trained using Decoupled DiLoCo equalled the performance attained with conventional training approaches.
Decoupled DiLoCo is not only more resilient to failures, but is also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking (a level relatively achievable using existing internet connectivity between datacenter facilities, rather than requiring new custom network infrastructure between facilities). Notably, the system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the "blocking" bottlenecks where one part of the system must wait for another.
Driving the evolution of AI training infrastructure
At Google, we take a full-stack approach to AI training, spanning hardware, software infrastructure and research. Increasingly, gains are coming from rethinking how these layers fit together.
Decoupled DiLoCo is one example. By enabling training jobs at internet-scale bandwidth, it can tap any unused compute wherever it sits, turning stranded resources into useful capacity.
Beyond efficiency and resilience, this training paradigm also unlocks the ability to mix different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the useful life of existing hardware, but also increases the total compute available for model training. In our experiments, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training.
What’s more, because new generations of hardware don’t arrive everywhere all at once, being able to train across generations can alleviate recurring logistical and capacity bottlenecks.
As we push the frontiers of AI infrastructure today, we’re continuing to explore approaches to resilient systems needed to unlock the next generation of AI.
Acknowledgements
This work was done by a team of members across Google DeepMind and Google Research.
The leads and core contributors behind Decoupled DiLoCo are Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, Ayush Dubey, Blake Woodworth, Ionel Gog, Josef Dean, Nova Fallen, Zachary Garrett. Operational support was done by Nate Keating and Jenny Bishop.
We are also grateful for the additional support and advising from Jeff Dean, Marc’Aurelio Ranzato, Raia Hadsell, Arthur Szlam, Edouard Yvinec, Henry Prior, Paul Barham, Michael Isard, Daniel Ramage, Brendan McMahan, Chase Hensel, and Zoltan Egyed.
Follow us
[](https://x.com/googledeepmind)
[](https://www.instagram.com/googledeepmind)
[](https://www.youtube.com/@googledeepmind)
[](https://www.linkedin.com/company/googledeepmind/)
[](https://github.com/google-deepmind)
Sign up for updates on our latest innovations
I accept Google's Terms and Conditions and acknowledge that my information will be used in accordance with Google's Privacy Policy.
Sign up
Build AI responsibly to benefit humanity
Models
GeminiNano BananaGemini AudioGenieLyriaVeo
Research
EvalsBreakthroughsPublicationsResponsibility
Science
AlphaFoldAlphaGenomeWeatherNextAlphaEarth
Products
Gemini appGoogle AI StudioGoogle Antigravity
Learn more
AboutNewsCareersNational Partnerships for AIThe Podcast
[](https://www.google.com/?utm_source=ai.google&utm_medium=referral "Google")
Cookies management controls