T
traeai
登录
返回首页
Google Cloud Blog

如何演进谷歌的全球和数据中心网络以适应AI时代

8.5Score
如何演进谷歌的全球和数据中心网络以适应AI时代

TL;DR · AI 摘要

谷歌通过构建全球和数据中心网络,使其能够支持AI时代的计算需求,包括分布式计算资源和高效的数据传输。

核心要点

  • 谷歌通过将数据中心靠近可持续能源源并利用网络分布AI工作负载来克服电力限制。
  • 谷歌创建了一个端到端的垂直集成AI技术堆栈,包括芯片、系统、平台和应用生态系统。
  • 谷歌的网络基础设施经过重新设计,以满足AI工作负载的高带宽、大规模和高性能需求。

结构提纲

按章节快速跳转。

  1. 谷歌在过去25年中构建了其全球网络,并面临AI时代的新挑战。

  2. 谷歌将数据中心靠近可持续能源源,并在电力容量不足时增加清洁能源供应。

  3. 谷歌创建了一个端到端的垂直集成AI技术堆栈,包括芯片、系统、平台和应用生态系统。

  4. 谷歌重新设计了网络基础设施,以满足AI工作负载的高带宽、大规模和高性能需求。

  5. 谷歌通过这些改进,确保其网络能够支持AI时代的计算需求。

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • 谷歌数据中心和全球网络
    • 数据中心选址策略
      • 靠近可持续能源源
      • 增加清洁能源供应
    • AI技术堆栈
      • 端到端垂直集成
      • 芯片、系统、平台、应用生态系统
    • 网络基础设施升级
      • 高带宽
      • 大规模
      • 高性能

金句 / Highlights

值得收藏与分享的关键句。

  • 谷歌通过将数据中心靠近可持续能源源并利用网络分布AI工作负载来克服电力限制。

    第 2 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 谷歌创建了一个端到端的垂直集成AI技术堆栈,包括芯片、系统、平台和应用生态系统。

    第 3 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 谷歌的网络基础设施经过重新设计,以满足AI工作负载的高带宽、大规模和高性能需求。

    第 4 段

    ⬇︎ 下载 PNG𝕏 分享到 X
#Google Cloud#AI#数据中心#网络基础设施
打开原文

Over the last 25 years of building Google’s global network, we’ve navigated major architectural eras — from the Internet, to streaming, and the cloud. Today, we are squarely in the midst of a fourth: the AI era. The applications in the AI era are fundamentally different from the consumer and enterprise applications of the previous eras and impose a set of novel and demanding requirements — on compute resources, of course, but also on the network.

Consider the fundamental physical challenge, which is that it is far more difficult to move electrons (electrical power) than it is to move photons (data over fiber). Because the demand for AI compute frequently outpaces the space and power capacities of individual facilities, we strategically locate data centers near sustainable energy sources, or in locations with pathways to add clean energy sources to the local grid. Then, by utilizing the network to distribute AI workloads across campuses, we create a massive-scale, pooled hypercomputing resource that overcomes the power limitations of any single site.

Image 1: https://storage.googleapis.com/gweb-cloudblog-publish/images/1_R90253L.max-1900x1900.jpg

To deliver this, we created an end-to-end, vertically integrated AI technology stack that comprises everything from chips to systems, to platforms and application and agentic ecosystems. This stack includes a portfolio of pre-built agents and applications; our Gemini Enterprise Agent Platform for you to build, scale, govern, and optimize your AI-enabled applications; world-class AI models; as well as our unified data platform. All this is anchored by our AI Hypercomputer, a unified infrastructure that combines purpose-built hardware and open software, and that comes with flexible consumption options. Our network, forged through decades of innovation, is the essential fabric of the AI Hypercomputer.

Image 2: https://storage.googleapis.com/gweb-cloudblog-publish/images/2_bZdv9ks.max-1100x1100.jpg

The network supporting this stack must meet the stringent bandwidth, scale, and performance needs of AI workloads. This applies not only within the campus, where the network must scale up and out, but also across the wide area network (WAN) along with high-bandwidth interconnects, to bring AI training data from its source to AI compute resources.

To address these challenges, we’ve reimagined three key pillars of our network infrastructure: the fabric inside the AI Hypercomputer, the fabric across the AI Hypercomputer, and our global network. Let’s take a closer look at each of these.

**1. The fabric inside AI Hypercomputer**

The massive scale of today’s AI models, fueled by the explosive growth of foundational AI model parameters, makes AI training very compute- and network-intensive.

Image 3: https://storage.googleapis.com/gweb-cloudblog-publish/images/3_eO2Dxet.max-1900x1900.jpg

This necessitates an exponential increase in required network bandwidth, with strict bounds on delay (e.g., tail latency) to accommodate AI workloads’ peculiar traffic patterns, which are characterized by sensitivity to performance variation and synchronized bursts, i.e., intense, coordinated, millisecond-level traffic spikes. Furthermore, since large-scale training jobs are uniquely vulnerable to failures and performance stragglers, maintaining high reliability and predictable performance is absolutely essential.

To address the scale, low latency, and high predictability that modern AI workloads require — as well as protection from extreme bursts — we’ve adopted a "campus as a computer" philosophy, decoupling our network into three distinct domains:

  • a scale-up domain for intra-pod connectivity
  • a dedicated east-west scale-out accelerator fabric
  • the Jupiterfrontend network for north-south compute and storage access

This decoupled architecture provides three strategic advantages: it allows domains to evolve independently for faster innovation; provides a non-blocking scale-out network with massive training bandwidth; and helps ensure the network can be co-designed in lockstep with new ML accelerators, for superior hardware support.

Recently, we unveiled **Virgo Network**, our scale-out data center fabric specifically engineered for modern AI. Virgo utilizes high-radix switches and a flat, two-layer non-blocking topology to provide massive bisection bandwidth, while minimizing latency by reducing network tiers. Its multi-planar design, featuring independent control domains for each plane, provides hardware-level resilience and fault isolation. Furthermore, Virgo can expand across multiple data centers, removing physical building limitations and enabling flexible AI compute scaling.

Image 4: https://storage.googleapis.com/gweb-cloudblog-publish/images/virgo_network_architecture_figure.max-2200x2200.jpg

The effectiveness of our network and accelerator codesign is perfectly illustrated by the recently debuted eighth generation TPUs. Within this architecture, Virgo Network can link 134,000 TPU 8t chips with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric. Virgo Network delivers up to 4x the bandwidth per TPU 8t accelerator over the previous generation, and 40% lower unloaded fabric latency for TPU 8t compared to the previous generation network for TPUs. In this setup, Virgo Network manages the raw accelerator traffic, while Jupiter provides reliable and rapid access to the global WAN and storage. When integrated with Pathways and JAX, this AI Hypercomputer networking engine facilitates near-linear scaling for up to a million TPU 8t chips in a single logical cluster.

Autonomous reliability: protecting workload goodput

Building a resilient megascale fabric represents only part of the challenge. In a cluster of hundreds of thousands of chips, hardware failures are a statistical certainty. A single stalled instance can stop an entire synchronous training job, wasting valuable compute cycles. As such, efficient fault localization is critical.

We engineered Virgo Network with autonomous reliabilitycapabilities to maximize workload efficiency at scale, a.k.a., goodput. Expanding on our existing straggler detection, Virgo Network now also features automated hang detection. The moment a fail-stop event occurs, our specialized agents immediately localize the fault, isolate the faulty instance, and enable you to restore the training job from a checkpoint — getting your training timeline back on track, with minimal manual intervention. Learn more by watching this demo:

Image 5: https://storage.googleapis.com/gweb-cloudblog-publish/images/maxresdefault_aGs9w20.max-1300x1300.jpg

Video 5

To complement these capabilities, we also use high-resolution, sub-millisecond telemetry to identify elusive network micro-bursts that are usually missed by conventional 30-second monitoring intervals. These high-resolution telemetry advancements enable more efficient network operations, better provisioning, and a lower mean time to recovery.

Image 6: https://storage.googleapis.com/gweb-cloudblog-publish/images/maxresdefault-1_rh3wgyf.max-1300x1300.jpg

Video 6

**2. The fabric across AI Hypercomputer**

The exponential growth of modern AI workloads requires us to scale and distribute AI workloads across multiple campuses over a WAN. At the same time, traditional networks weren’t built for the high bandwidth and extreme burstiness of AI traffic, and often fail to detect microbursts that can lead to severe performance degradation. We have developed a suite of innovations to optimize WAN performance for cross-site AI deployments, including:

  • A multi-shard global network that enables horizontal scaling. Our global network sustained a 10X WAN traffic growth from 2020 to 2025.
  • T uning the fabric for essential availability, latency, and quality of service (QoS) attributes. Real-time microburst management helps ensure fair bandwidth allocation and infrastructure isolation across our multi-tenant infrastructure.
  • Multi-shard isolation to ensure each network shard operates with its own control, data, and management planes.

Combined with regional isolation and Protective Reroute, this architecture minimizes failure impact and shortens user-visible outages — delivering the beyond-nines reliability essential for AI workloads.

Providing high-speed, flexible, and cost-effective interconnectivity is also a priority. AI training relies on vast datasets that are often located on-premises or across various clouds. Given the high cost of AI compute, minimizing idle time is essential; for instance, upgrading from a 100 Gbps link to a 3.2 Tbps connection reduces the time to transfer a petabyte of data from 22.2 hours to just 0.7 hours — a 97% reduction in AI compute idle time spent waiting for data. Our AI-native Cloud Interconnect is purpose-built for the high-bandwidth and low-latency needs of AI workloads, featuring an optimized data path with 400 Gbps links that scale in 3.2 Tbps increments to reach petabit-per-second capacity. It also offers traffic differentiation and flexible connection options, including direct fiber peering and colocation facilities. AI-native Cloud Interconnect supports petabit-scale data transfer with reliable, private connectivity necessary for your cross-cloud AI training and serving.

**3. A resilient global network for the age of inference**

Applications serving AI inference to a global user population or supporting an agentic enterprise are far more demanding than conventional web apps. The need for opportunistic use of expensive AI compute available at distant locations, distributed service dependencies, and the burstiness of the traffic demand high bandwidth network with a global footprint, as well as deep peering to SaaS providers, ISPs, and hyperscalers. To maintain responsiveness and "always-on" availability, applications need low latency and a highly resilient network.

Image 7: https://storage.googleapis.com/gweb-cloudblog-publish/images/5_bogYf7C.max-1300x1300.jpg

With its connectivity, scale, and resilience, Google’s global network is well-equipped to handle the demands of the age of AI inference. Our network spans more than 10 million kilometers of terrestrial and subsea fiber, connects our 43 cloud regions, and features 200+ edge locations, providing the essential footprint for serving AI inference. Our Premium Tier network delivers the low latency and reliability needed for consistent, high-quality global user experience. By optimizing traffic entry and exit points, the network significantly boosts application performance, with resilience at the core of this "always-on" infrastructure.

**Building the future, together**

As a Google Cloud customer, these network innovations are built directly into your environment. Google’s network delivers the massive scale, capacity, reliability and performance essential for your AI workloads.

The AI era demands more than just raw compute; it necessitates a robust network fabric to scale. Our vertically integrated AI technology stack — from silicon to software ecosystems — is powered by the AI Hypercomputer to accelerate your transformation and make AI helpful for everyone. Whether through our megascale fabric, resilient global network for inference, or AI-native Cloud Interconnect, we ensure your AI journey is efficient and reliable. We look forward to building this future with you.

Posted in

AI 可能会生成不准确的信息,请核实重要内容