Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

- Snapstone 实现配置变更的渐进式发布、实时健康监测与自动回滚,统一覆盖所有高风险配置单元。
- 引入「fail stale / fail open / fail close」三级故障策略,显著缩小单点故障影响半径。
- 所有核心流量产品已强制采用健康中介部署(health-mediated deployment),配置即代码的安全等级等同于软件发布。
结构提纲
按章节快速跳转。
宣布历时两个多季度的「Fail Small」韧性升级工程正式完成,可规避2025年两次重大中断。
新自研系统将任意配置单元(数据文件/控制标志)纳入渐进发布、健康监控与自动回滚闭环。
通过移除非必要运行时依赖、定义 fail stale/open/close 策略,实现最小化服务中断半径。
建立配置漂移检测机制,并标准化 outage 期间客户通知时效与透明度要求。
明确指出本次升级直接覆盖11月18日与12月5日两次中断的根因配置路径。
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- Cloudflare Code Orange: Fail Small
- 核心系统
- Snapstone:配置健康中介平台
- Fail Stale / Open / Close 三级策略
- 关键改进
- 配置渐进发布与自动回滚
- 故障影响半径最小化
- 防配置漂移与通信标准化
- 验证成果
- 规避2025.11.18中断根因
- 规避2025.12.05中断根因
金句 / Highlights
值得收藏与分享的关键句。
Snapstone 允许团队按需定义任意配置单元为健康中介对象——无论是引发11月中断的数据文件,还是导致12月中断的全局控制标志。
我们不再‘瞬间全网推送配置’;而是默认渐进发布+实时健康监测,问题可在影响用户前被观测并自动回滚。
当故障发生时,系统优先使用‘最后已知良好配置’(fail stale);否则依场景选择‘降级开放’或‘安全关闭’,而非全局不可用。
配置变更的安全等级,现在与软件发布完全对齐——这是 Cloudflare 将‘配置即代码’真正落地的关键里程碑。
Code Orange: Fail Small is complete. The result is a stronger Cloudflare network
2026-05-01
8 min read

Over the past two and a bit quarters, we've undertaken an intensive engineering effort, internally code-named "Code Orange: Fail Small", focused on making Cloudflare's infrastructure more resilient, secure, and reliable for every customer.
Earlier this month, the Cloudflare team finished this work.
While improving resiliency will never be a “job done” and will always be a top priority across our development lifecycle, we have now completed the work that would have avoided the November 18, 2025 and December 5, 2025 global outages.
This work focused on several key areas: safer configuration changes, reducing the impact of failure, and revising our “break glass” procedures and incident management. We also introduced measures to prevent drift and regressions over time, and strengthened the way we communicate to our customers during an outage.
Here we explain in depth what we shipped, and what it means for you.
Safer configuration changes
**_What it means for you_**_: In most cases, Cloudflare internal configuration changes no longer reach our network instantly and are instead rolled out progressively with real-time health monitoring. This allows our observability tools to catch problems and revert issues before they affect your traffic._
In order to catch potentially dangerous deployments before they reach production, we've identified high-risk configuration pipelines, and built new tools to manage configuration changes better.
For products that run on our network processing customer traffic and receive configuration changes, we no longer deploy these changes instantly across the network. Instead, relevant teams have adopted a “health-mediated deployment” methodology, the same we use when releasing software, for all configuration deployments. This includes but is not limited to the product teams that were directly affected by the incidents.
Central to this is a new internal component we call Snapstone, which we built to bring health-mediated deployment to configuration changes. Snapstone is a system that bundles configuration change into a package, and then allows gradual release of the configuration change with health mediation principles. Before Snapstone, applying this methodology to config was possible but difficult. It required significant per-team effort and wasn't consistently applied across the network. Snapstone closes this gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default.
What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, Snapstone allows teams to dynamically define any unit of configuration that needs health mediation, whether that's a data file like the one that caused the November 18 outage, or a control flag in our global configuration system like the one involved in the December 5 outage. Teams create these configuration units on demand, and Snapstone ensures they are deployed safely everywhere they're used.
This gives us something we didn't have before: when a risk review or operational experience identifies a dangerous configuration pattern, the fix is straightforward -- bring it into Snapstone, and the configuration pattern immediately inherits safe deployment.
Reducing the impact of failure
**_What it means for you_**_: In the event an issue is observed on our network, our systems now fail more gracefully. This vastly reduces the potential impact radius, to ensure your traffic is delivered even in worst-case scenarios._
Product teams have carefully reviewed, both in a manual and programmatic fashion, their potential failure modes for products that are critical for serving customer traffic. Teams have removed non-essential runtime dependencies and implemented better failure modes. We will now use the last known good configuration where possible (“fail stale”), and if that isn’t possible we have reviewed each failure case and implemented “fail open” or “fail close” depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic.
Let’s look at an example of how this works. Our November 2025 outage was triggered by a failed rollout of our Bot Management detection machine learning classifier. Under our new procedures, if data were generated again that our system could not read, the system would refuse to use the updated configuration and instead use the old configuration. If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.
As a result, if the same Bot Management change that caused the failure in November were to roll out now, the system would detect the failure in an early stage of the deployment, before it had affected anything more than a small percentage of traffic.
We have also begun further segmenting our system so that independent copies of services run for different cohorts of traffic. Cloudflare already takes advantage of these customer cohorts for blast radius mitigation with traffic management techniques today, and this additional process segmentation work provides a powerful reliability capability for us going forward.
For example, the Workers runtime system is segmented into multiple independent services handling different cohorts of traffic, with one handling only traffic for our free customers. Changes are deployed to these segments based on customer cohorts, starting with free customers first. We’re also sending updates more quickly and frequently to the least critical segments, and at a slower pace to the most critical segments.
As a result, if a change were deployed to the Workers runtime system and it broke traffic, it would now only affect a small percentage of our free customers before being automatically detected and rolled back.
Sticking to the Workers runtime system as an example, in a seven-day period earlier this month, the deployment process was triggered more than 50 times. You can see how each happens in “waves” as the change propagates to the edge, often in parallel to the following and prior releases:

We’re working on extending this pattern of deployment to many more of our systems in the future.
Revised “break glass” and incident management procedures
**_What it means for you_**_: If an incident does occur, we have the tools and teams to communicate more clearly and resolve it faster, minimizing downtime._
Cloudflare runs on Cloudflare. We use our own Zero Trust products to secure our infrastructure, but this creates a dependency: if a network-wide outage impacts these tools, we lose the very pathways we need to fix them. Before this Code Orange initiative, our "break glass" pathways were restricted to a handful of people and offered limited tool access. We needed these tools and pathways to be more broadly available during an outage.
To solve this, we conducted a comprehensive audit of the tools essential for system visibility, debugging, and production changes. We ultimately developed backup authorization pathways for 18 key services, supported by new emergency scripts and proxies.
Throughout the Code Orange program, we moved from theory to practice. After small-team exercises, we conducted an engineering-wide drill on April 7, 2026, involving more than 200 team members. While automation keeps these pathways functional, drills like these ensure our engineers have the muscle memory to use them under pressure.
This effort also focused on the flow of information. When internal visibility is disrupted, our incident response slows down, and our ability to communicate with the outside world suffers. Historically, technical observations from the heat of the moment didn't always translate into clear updates for our customers.
To bridge this gap, we established a dedicated communications team to work in lockstep with incident responders during major events. Just as our engineers practiced their "break glass" procedures, this team used the Code Orange program to drill on streamlining the cadence and clarity of customer updates. By ensuring we have both the tools to see and the structure to speak, we can resolve incidents faster and keep our customers better informed.
We have codified our improvements
**_What it means for you_**_: We will remember the learnings from our incidents and have codified the resolutions. Our network will only become more resilient._
To avoid drift and reintroducing regressions to the work done as part of Code Orange over time, the team has built an internal Codex that solidifies all our guidelines in clear and concise rules.
The Codex is now mandatory for all engineering and product teams, and has become a central part of Cloudflare internal procedures. Its rules are enforced via AI code reviews that automatically highlight any instance that might diverge from the guidelines, requiring additional manual reviews be performed. This is applied without exception to our entire codebase. The goal is simple: Build institutional memory that enforces itself.
The November and December outages shared a common failure mode: code that assumed inputs would always be valid, with no graceful degradation when that assumption broke. A Rust service called .unwrap() instead of handling an error; Lua code indexed an object that didn't exist. Both patterns are preventable if the lessons are captured and enforced.
The Codex is part of our answer. It's a living repository of engineering standards written by domain experts through our Request For Comments (RFC) process, then distilled into actionable rules. Best practices that previously lived in the heads of senior engineers, or were discovered only after an incident, now become shared knowledge accessible to everyone. Each rule follows a simple format: "If you need X, use Y" with a link to the RFC that explains why.
For example, one RFC now states: "Do not use .unwrap() outside of tests and `build.rs.`" Another captures a broader principle: "Services MUST validate that upstream dependencies are in an expected state before processing."
Had these rules been enforced earlier, the November and December outages would have been rejected merge requests instead of global incidents.
Rules without enforcement are suggestions. The Codex integrates with AI-powered agents at every stage of the software development lifecycle, from design review through deployment to incident analysis. This shifts enforcement left, from "global outage" to "rejected merge request." The blast radius of a violation shrinks from millions of affected requests to a single developer getting actionable feedback before their code ever reaches production.
The Codex is a living document and will be continuously improved over time. Domain experts write RFCs to codify best practices. Incidents surface gaps that become new RFCs. Every approved RFC generates Codex rules. Those rules feed the agents that review the next merge request. It's a flywheel: expertise becomes standards, standards become enforcement, enforcement raises the floor for everyone.
It’s not just about code: communication is key
**_What it means for you_**_: Transparency is important to us. If something goes wrong, we’re committed to keeping you updated every step of the way so you can stay focused on what matters to you._
The global outages have made us review core processes and cultural approaches even beyond engineering and product development. As part of the broader Code Orange initiatives, we have introduced additional service level objectives (SLOs) to all our services, enforced a global changelog, onboarded all teams to our maintenance coordination system, and improved transparency across the company on our incident “prevents” ticket backlog.
We have also strengthened the way we communicate to our customers during an outage. Our goal is to alert you to an issue the moment we confirm it, before you even notice a problem. By the time you notice a lag or an error, our aim is to have an update already waiting in your notifications.
During an active incident, we now provide updates at predictable intervals (e.g., every 30 or 60 minutes), even if the update is simply, "We are still testing the fix; no new changes yet." This allows you to plan your day rather than constantly refreshing a status page.
Our job isn't done when the status returns to normal. We provide detailed post-mortems explaining what happened, why it happened, and the specific structural changes we are making to ensure it doesn't happen again.
This initiative is complete. But our work on resiliency is never done.
We take the incidents very seriously and adopted a shared ownership across the entire Cloudflare organization by asking every team: What could have been done better? This guided the work that we carried out over the last two quarters.
While this work is never truly done, we are confident that we are in a much better position and Cloudflare is now much stronger because of it.
Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
问问这篇内容
回答仅基于本篇材料Skill 包
领域模板,一键产出结构化笔记论文精读包
把一篇论文 / 技术博客精读成结构化笔记:问题、方法、实验、批判、延伸阅读。
- · TL;DR(1 段)
- · 研究问题与动机
- · 方法概览
投融资雷达包
把一条融资 / 创投新闻整理成投资人视角的雷达卡:交易要点、判断、竞争格局、风险、尽调清单。
- · 交易要点(公司 / 轮次 / 金额 / 投资人 / 估值,材料未明示则写 “未披露”)
- · 投资 thesis(这家公司为什么值得关注)
- · 竞争格局与替代方案