T
traeai
登录
返回首页
Google DeepMind Blog

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

5.0Score
Decoupled DiLoCo: A new frontier for resilient, distributed AI training
AI 深度提炼

Decoupled DiLoCo: Resilient, Distributed AI Training at Scale — Google DeepMind

Skip to main content

Explore our next generation AI systems

Explore models

Gemini

![Image 1 Gemini Learn, build, and plan anything](http://deepmind.google/models/gemini/)![Image 2 Nano Banana Create and edit detailed images](http://deepmind.google/models/gemini-image/)![Image 3 Gemini Audio Talk, create and control audio](http://deepmind.google/models/gemini-audio/)

Specialized models

![Image 4 Veo Generate cinematic video with audio](http://deepmind.google/models/veo/)![Image 5 Imagen Generate high-quality images from text](http://deepmind.google/models/imagen/)![Image 6 Lyria Generate high fidelity music and audio](http://deepmind.google/models/lyria/)

World models & embodied AI

![Image 7 Genie 3 Generate and explore interactive worlds](http://deepmind.google/models/genie/)![Image 8 Gemini Robotics Perceive, reason, use tools and interact](http://deepmind.google/models/gemini-robotics/)

Open models

![Image 9 Gemma Build responsible AI applications at scale](http://deepmind.google/models/gemma/)

Our latest AI breakthroughs and updates from the lab

Explore research

Breakthroughs

![Image 10 SIMA 2 An agent that plays, reasons, and learns with you](http://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/)![Image 11 Genie 3 Generate and explore interactive worlds](http://deepmind.google/models/genie/)![Image 12 AlphaGo Mastering the game of Go](http://deepmind.google/research/alphago/)![Image 13 Gemini Robotics Perceive, reason, use tools and interact](http://deepmind.google/models/gemini-robotics/)

Learn more

EvalsPublicationsResponsibility

Unlocking a new era of discovery with AI

Explore science

Life sciences

![Image 14 AlphaFold Predict protein structures with high accuracy](http://deepmind.google/science/alphafold/)![Image 15 AlphaGenome Decode genetics to pinpoint diseases](http://deepmind.google/blog/alphagenome-ai-for-better-understanding-the-genome/)![Image 16 AlphaMissense Find root causes of rare genetic diseases](http://deepmind.google/blog/a-catalogue-of-genetic-mutations-to-help-pinpoint-the-cause-of-diseases/)

Climate and sustainability

![Image 17 AlphaEarth Foundations Map our planet in unprecedented detail](http://deepmind.google/blog/alphaearth-foundations-helps-map-our-planet-in-unprecedented-detail/)![Image 18 WeatherNext Fast and accurate AI weather forecasting](http://deepmind.google/science/weathernext/)![Image 19 Weather Lab Test our experimental weather models](https://deepmind.google.com/science/weatherlab/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)

Our mission is to build AI responsibly to benefit humanity

About Google DeepMind

Responsibility Ensuring AI safety through proactive security, even against evolving threatsNews Discover our latest AI breakthroughs, projects, and updatesCareers We’re looking for people who want to make a real, positive impact on the world

Education We work to make AI more accessible to the next generationOur National Partnerships for AI Working with governments worldwide to benefit people through frontier AIThe Podcast Join Professor Hannah Fry as she uncovers the extraordinary way AI is transforming our world

Models

Explore our next generation AI systems

Explore models

Gemini

![Image 20 Gemini Learn, build, and plan anything](http://deepmind.google/models/gemini/)![Image 21 Nano Banana Create and edit detailed images](http://deepmind.google/models/gemini-image/)![Image 22 Gemini Audio Talk, create and control audio](http://deepmind.google/models/gemini-audio/)

Specialized models

![Image 23 Veo Generate cinematic video with audio](http://deepmind.google/models/veo/)![Image 24 Imagen Generate high-quality images from text](http://deepmind.google/models/imagen/)![Image 25 Lyria Generate high fidelity music and audio](http://deepmind.google/models/lyria/)

World models & embodied AI

![Image 26 Genie 3 Generate and explore interactive worlds](http://deepmind.google/models/genie/)![Image 27 Gemini Robotics Perceive, reason, use tools and interact](http://deepmind.google/models/gemini-robotics/)

Open models

![Image 28 Gemma Build responsible AI applications at scale](http://deepmind.google/models/gemma/)

Research

Our latest AI breakthroughs and updates from the lab

Explore research

Breakthroughs

![Image 29 SIMA 2 An agent that plays, reasons, and learns with you](http://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/)![Image 30 Genie 3 Generate and explore interactive worlds](http://deepmind.google/models/genie/)![Image 31 AlphaGo Mastering the game of Go](http://deepmind.google/research/alphago/)![Image 32 Gemini Robotics Perceive, reason, use tools and interact](http://deepmind.google/models/gemini-robotics/)

Learn more

EvalsPublicationsResponsibility

Science

Unlocking a new era of discovery with AI

Explore science

Life sciences

![Image 33 AlphaFold Predict protein structures with high accuracy](http://deepmind.google/science/alphafold/)![Image 34 AlphaGenome Decode genetics to pinpoint diseases](http://deepmind.google/blog/alphagenome-ai-for-better-understanding-the-genome/)![Image 35 AlphaMissense Find root causes of rare genetic diseases](http://deepmind.google/blog/a-catalogue-of-genetic-mutations-to-help-pinpoint-the-cause-of-diseases/)

Climate and sustainability

![Image 36 AlphaEarth Foundations Map our planet in unprecedented detail](http://deepmind.google/blog/alphaearth-foundations-helps-map-our-planet-in-unprecedented-detail/)![Image 37 WeatherNext Fast and accurate AI weather forecasting](http://deepmind.google/science/weathernext/)![Image 38 Weather Lab Test our experimental weather models](https://deepmind.google.com/science/weatherlab/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)

About

Our mission is to build AI responsibly to benefit humanity

About Google DeepMind

Responsibility Ensuring AI safety through proactive security, even against evolving threatsNews Discover our latest AI breakthroughs, projects, and updatesCareers We’re looking for people who want to make a real, positive impact on the worldEducation We work to make AI more accessible to the next generationOur National Partnerships for AI Working with governments worldwide to benefit people through frontier AIThe Podcast Join Professor Hannah Fry as she uncovers the extraordinary way AI is transforming our world

Build with GeminiTry Gemini

Google DeepMind DeepMind

Google AI Learn about all our AIGoogle DeepMind Explore the frontier of AIGoogle Labs Try our AI experimentsGoogle Research Explore our research

Products and apps

Gemini app Chat with GeminiGoogle AI Studio Build with our next-gen AI modelsGoogle Antigravity Our agentic development platform

Models

Research

Science

About

Build with GeminiTry Gemini

April 23, 2026 Research

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

Arthur Douillard and the DiLoCo team

  • [x]

Share

[](https://twitter.com/intent/tweet?url=https://deepmind.google/blog/decoupled-diloco/&text=Decoupled%20DiLoCo%3A%20A%20new%20frontier%20for%20resilient%2C%20distributed%20AI%20training)[](https://www.facebook.com/sharer/sharer.php?u=https://deepmind.google/blog/decoupled-diloco/)[](https://www.linkedin.com/sharing/share-offsite/?url=https://deepmind.google/blog/decoupled-diloco/)[](mailto:?subject=Decoupled%20DiLoCo%3A%20A%20new%20frontier%20for%20resilient%2C%20distributed%20AI%20training&body=https://deepmind.google/blog/decoupled-diloco/)Copied

!Image 39!Image 40

Your browser does not support the video tag.

Our new distributed architecture helps to train LLMs across distant data centers - with lower bandwidth and more hardware resiliency.

Training a frontier AI model traditionally depends on a large, tightly coupled system in which identical chips must stay in near-perfect synchronization. This approach is highly effective for today’s state-of-the-art models, but as we look toward future generations of scale, maintaining this level of synchronization across thousands of chips becomes a significant logistical challenge.

Today, in a new paper we are excited to share a new approach to this problem, called Decoupled DiLoCo (Distributed Low-Communication). By dividing large training runs across decoupled “islands” of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently.

The result is a more resilient and flexible way to train advanced models across globally distributed data centers. And crucially, Decoupled DiLoCo does not suffer the communication delays that made previous distributed methods like Data-Parallel impractical at global scale.

As frontier models continue to grow in scale and complexity, we’re exploring diverse approaches to train models across more compute, locations and varied hardware.

!Image 41!Image 42

Your browser does not support the video tag.

Figure 1: Decoupling training runs into separate “islands” of compute (learner units) allows largely uninterrupted training despite the same level of hardware failures, because the effects of those failures are isolated.

Developing more fault-tolerant asynchronous training at scale

Decoupled DiLoCo builds on two earlier advances: Pathways, which introduced a distributed AI system based on asynchronous data flow, and DiLoCo, which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations.

Decoupled DiLoCo brings those ideas together to train AI models more flexibly at scale. Built on top of Pathways, it enables asynchronous training across separate islands of compute (known as learner units) so that a chip failure in one area doesn’t interrupt the progress of the others.

This infrastructure is also self-healing. In testing, we used a method called “chaos engineering” to introduce artificial hardware failures during training runs. Decoupled DiLoCo continued the training process after the loss of entire learner units, and then seamlessly reintegrated them when they came back online.

Testing Decoupled DiLoCo with Gemma 4 models demonstrated that, when hardware fails, the system maintains greater availability of learning clusters than more traditional training methods — while ultimately delivering the same benchmarked level of machine learning (ML) performance.

!Image 43: This set of three bar charts compares the performance of Data-Parallel training against Decoupled DiLoCo across communication, resilience, and accuracy metrics. The first chart, Required Bandwidth, shows that DiLoCo reduces bandwidth needs from 198 Gbps to a mere 0.84 Gbps across 8 datacenters, representing a massive efficiency gain on a logarithmic scale. The second chart, Goodput, demonstrates that in a simulated environment of 1.2 million chips with high failure rates, DiLoCo maintains a 88% goodput compared to only 27% for standard Data-Parallel methods. Finally, the ML Benchmarks chart highlights that these gains come with virtually no cost to performance, as DiLoCo achieves 64.1% average accuracy, nearly matching the 64.4% achieved by the baseline.!Image 44: This set of three bar charts compares the performance of Data-Parallel training against Decoupled DiLoCo across communication, resilience, and accuracy metrics. The first chart, Required Bandwidth, shows that DiLoCo reduces bandwidth needs from 198 Gbps to a mere 0.84 Gbps across 8 datacenters, representing a massive efficiency gain on a logarithmic scale. The second chart, Goodput, demonstrates that in a simulated environment of 1.2 million chips with high failure rates, DiLoCo maintains a 88% goodput compared to only 27% for standard Data-Parallel methods. Finally, the ML Benchmarks chart highlights that these gains come with virtually no cost to performance, as DiLoCo achieves 64.1% average accuracy, nearly matching the 64.4% achieved by the baseline.

Figure 2: **Left**: The Decoupled DiLoCo approach requires orders of magnitude less bandwidth than conventional training methods, making it very efficient. **Middle**: With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of “goodput”, or useful training, while that of other approaches nosedives. (The first two charts are based on simulated training runs). **Right**: In real-world experiments, the benchmarked ML performance of Gemma 4 models trained using Decoupled DiLoCo equalled the performance attained with conventional training approaches.

Decoupled DiLoCo is not only more resilient to failures, but is also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking (a level relatively achievable using existing internet connectivity between datacenter facilities, rather than requiring new custom network infrastructure between facilities). Notably, the system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the "blocking" bottlenecks where one part of the system must wait for another.

Driving the evolution of AI training infrastructure

At Google, we take a full-stack approach to AI training, spanning hardware, software infrastructure and research. Increasingly, gains are coming from rethinking how these layers fit together.

Decoupled DiLoCo is one example. By enabling training jobs at internet-scale bandwidth, it can tap any unused compute wherever it sits, turning stranded resources into useful capacity.

Beyond efficiency and resilience, this training paradigm also unlocks the ability to mix different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the useful life of existing hardware, but also increases the total compute available for model training. In our experiments, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training.

What’s more, because new generations of hardware don’t arrive everywhere all at once, being able to train across generations can alleviate recurring logistical and capacity bottlenecks.

As we push the frontiers of AI infrastructure today, we’re continuing to explore approaches to resilient systems needed to unlock the next generation of AI.

Read our technical report

Acknowledgements

This work was done by a team of members across Google DeepMind and Google Research.

The leads and core contributors behind Decoupled DiLoCo are Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, Ayush Dubey, Blake Woodworth, Ionel Gog, Josef Dean, Nova Fallen, Zachary Garrett. Operational support was done by Nate Keating and Jenny Bishop.

We are also grateful for the additional support and advising from Jeff Dean, Marc’Aurelio Ranzato, Raia Hadsell, Arthur Szlam, Edouard Yvinec, Henry Prior, Paul Barham, Michael Isard, Daniel Ramage, Brendan McMahan, Chase Hensel, and Zoltan Egyed.

Follow us

[](https://x.com/googledeepmind)

[](https://www.instagram.com/googledeepmind)

[](https://www.youtube.com/@googledeepmind)

[](https://www.linkedin.com/company/googledeepmind/)

[](https://github.com/google-deepmind)

Sign up for updates on our latest innovations

I accept Google's Terms and Conditions and acknowledge that my information will be used in accordance with Google's Privacy Policy.

Sign up

Build AI responsibly to benefit humanity

Models

GeminiNano BananaGemini Audio![Image 45: footer_gemma__light!Image 46: footer_gemma__dark Gemma](http://deepmind.google/models/gemma/)GenieLyriaVeo

Research

EvalsBreakthroughsPublicationsResponsibility

Science

AlphaFoldAlphaGenomeWeatherNextAlphaEarth

Products

Gemini appGoogle AI StudioGoogle Antigravity

Learn more

AboutNewsCareersNational Partnerships for AIThe Podcast

[](https://www.google.com/?utm_source=ai.google&utm_medium=referral "Google")

About Google

Google products

Privacy

Terms

Cookies management controls

Decoupled DiLoCo: A new frontier for resilient, distributed AI training | Google DeepMind Blog | traeai