Augment Code on X: "Scaling incident management for an AI-native organization using Cosmos"
TL;DR · AI 摘要
通过使用 Augment Cosmos 平台,将大部分 incident response 自动化到 Slack 中,显著减少了人类 on-call 调查工作量,达到 81% 的减少。
核心要点
- 通过自动化 incident response,减少 81% 的人类 on-call 调查工作量。
- Incident Investigator 在 Slack 中运行,自动进行 triage 和 root cause analysis,并推荐 remediat
- Cosmos 平台提供了结构化的专家系统,覆盖从 triage 到 post-resolution 的完整流程。
结构提纲
按章节快速跳转。
- §引言
介绍在 AI 原生组织中,随着编码和代码审查的自动化,下一个主要瓶颈是 incident management。
- ·问题背景
On-call 工程师在轮班期间难以参与功能开发,小型快速移动团队的运营负担直接影响工程速度。
- §解决方案
使用 Augment Cosmos 平台自动化 incident response,减少人类 on-call 调查工作量。
Cosmos 是一个专为自动化工程工作流设计的操作系统,支持长时间运行的专家跨 SDLC 协作。
在 Slack 中运行,自动进行 triage 和 root cause analysis,并推荐 remediation action。
- §效果
通过自动化,减少 81% 的人类 on-call 调查工作量,提高工程效率。
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- 自动化 incident response
- 减少 81% 的人类 on-call 调查工作量
- Incident Investigator 专家
- 在 Slack 中运行
- 自动进行 triage 和 root cause analysis
- 推荐 remediation action
- Cosmos 平台
- 专为自动化工程工作流设计
- 支持长时间运行的专家跨 SDLC 协作
金句 / Highlights
值得收藏与分享的关键句。
通过自动化 incident response,减少 81% 的人类 on-call 调查工作量。
Incident Investigator 在 Slack 中运行,自动进行 triage 和 root cause analysis,并推荐 remediation action。
Cosmos 平台提供了结构化的专家系统,覆盖从 triage 到 post-resolution 的完整流程。
As coding and code review become increasingly automated in AI-native organizations, one of the next major bottlenecks is incident management. On-call engineers are often unable to contribute meaningfully to feature development during their rotation, and in small, fast-moving teams, the operational load becomes a direct drag on engineering velocity.
Even with modern observability tooling, incident response still requires engineers to reconstruct operational context under pressure: correlating deploys, searching Slack threads, checking dashboards, identifying ownership, and rediscovering previous incidents. When an alert fires, engineers typically jump between PagerDuty, Slack, dashboards, logs, metrics, GitHub, and prior incidents trying to answer a few core questions:
- What actually broke?
- Did a recent deploy cause this?
- Who owns the affected service?
- Is this a known issue?
- What should we do next?
Most of this work is repetitive investigative work.
Our goal is to have agents drive the repetitive investigation of incident management while pulling humans in primarily for judgment, prioritization, and remediation decisions.
This article covers how we use the Augment Cosmos platform to automate large parts of incident response directly inside Slack, resulting in an 81% reduction in human on-call investigation effort.
Cosmos
Earlier this year, we rolled out
internally: our operating system for agentic software development. Cosmos is purpose-built for automating engineering workflows with long-running experts that can work across your SDLC, collaborate with humans, connect to your tools, and continuously improve over time. Each Cosmos automation comes in the form of an Expert, which has its own prompt, integrations, environment, secrets, event triggers, subscriptions, worker experts, and more. This blog focuses on a Cosmos Expert for incident management.
After
, Augment engineers started moving significantly faster on feature development. But on-call was still a major operational bottleneck.
In practice, on-call rotations consistently pulled engineers away from roadmap work for alert triage and incident investigation. This was especially painful in our small, fast-moving teams (typically 2–5 engineers), where newer on-call engineers often escalated incidents to senior engineers for additional context or validation.
Even when alerts turned out to be transient failures, noisy false positives, or incidents that auto-resolved quickly, the investigation still consumed substantial engineering time. A typical engineer spent 30 minutes actively triaging an incident, while newer engineers took even more. On-call engineers were interrupted 5 times/day, and senior engineers were pulled into 20% of alerts.
That was the bottleneck we wanted to remove: not human judgment, but the repetitive investigation required before humans could make good decisions.
Our primary operational expert is the Incident Investigator. Slack is the natural surface for this expert because it has effectively become the operational control plane for incident response.
The Incident Investigator runs triage and root cause analysis on every alert, then routes to one of four remediation paths. Humans step in to review the RCA, ask follow-up questions, and approve the action.
The Incident Investigator operates directly inside incident Slack channels. It reacts to PagerDuty alerts, performs structured investigations, and posts an initial RCA (Root Cause Analysis) and recommended remediation action in-thread (code-fix, rollback, escalate, or only monitor), before a human has even looked at the alert. A human scans the RCA, optionally asks follow-up questions and takes a judgement call on whether the remediation action is appropriate. This is the only place a human is involved and for the average alert this takes less than a minute. When a code change is required, the Incident Investigator hands off the code-fix to a PR Author expert, which
.
The investigator’s report is highly structured. The expert follows a fixed operational workflow covering triage, investigation, communication, remediation recommendation, and post-resolution summarization. The prompt defines operational procedures, escalation rules, scope boundaries, and constraints, while the LLM handles evidence gathering, hypothesis refinement, validation, and communication. In practice, this produces significantly more reliable operational behavior than either free-form agent loops or human on-call engineers.
Below is an example of the initial RCA and recommended fix:
Engineers sometimes continue the Slack thread with follow-up questions/requests for the Incident Investigator such as:
- “I think this alert will likely auto-resolve. Is that correct?”
- “does this affect tenants other than X?”
- “is this related to the previous alert from 30 minutes ago?”
- “what if we tune the threshold of the alert to 2 failures per hour?”
Finally, the Incident Investigator posts a resolution summary, captures key learnings in memory, and optionally writes a post-mortem report.
This creates a broader operational loop where agents:
- investigate failure root cause and impact
- propose remediations
- answer questions
- generate fixes
- write resolution summaries and post-mortems
- records learnings from human interaction
while humans remain responsible for judgment calls and production-impacting decisions.
Generating high quality RCAs
The most important piece in getting the Incident Investigator to generate high quality analyses (typically higher quality than an average developer) is to provide it everything an on-call engineer would have.
Tools
The Incident Investigator gathers evidence from logs, metrics, recent deploys, GitHub history, code-context, ownership mappings, and recent alerts on the same channel. So we need to ensure that the Cosmos environment (i.e. VM) has the required tools installed to access logs/metrics (eg. gcloud cli) and authentication (eg. access tokens) setup via Cosmos secrets.
Context
Now that your Incident Investigator can access logs, metrics, etc. it also needs to know what kinds of queries to make, and we use Agent Skills (
) for logs and metrics analysis. These skills are internal to Augment today. They live in our repo and they define:
- how to query observability systems
- operational constraints
- common query patterns
- environment mappings
- debugging workflows
For example, our metrics skill wraps GCP Managed Prometheus and exposes structured PromQL querying patterns for request rates, error rates, pod restarts, and deployment health. Similarly, our logs skill wraps GCP Cloud Logging and kubectl workflows for querying production and staging logs, correlating events across pods, and reconstructing incident timelines. This allows you to customize investigation behavior, add organization-specific workflows, or swap in alternative observability stacks entirely. A common pattern is also ‘Runbooks as code’ (i.e. incident runbooks stored in the codebase), and some teams at Augment use them for team-specific behaviors.
Memory
The final piece that really gets the Incident Investigator operating at the level of your senior engineers is memory. As the expert interacts with humans, it records important tribal knowledge that isn’t documented anywhere, and over time this fills in the knowledge gaps that humans missed recording in Agent Skills or Runbooks (because let’s be honest - many teams drop the ball on documentation).
Another benefit of adopting Cosmos for your SDLC: Memory is shared between Incident Investigators and other experts (such as Code Review), meaning that learnings from one will propagate to the other, resulting in higher overall software quality.
Operational Impact
We analyzed our incident response data for a month before and after deploying the Incident Investigator across five on-call channels, and we summarize the effect on two key metrics: reduction in developer effort on incident response and time to resolution.
On-call developer effort for incident response:
The shift in who does the incident response is the most striking:
Before Cosmos, nearly every incident had a human doing the initial investigation and coming up with remediation actions (some pulled in interactive coding agents to help investigate different facets, but most of the analysis was done by humans). After, fewer than one in five did: an ~81% reduction in human on-call work. All this freed developer time results in faster velocity: our data shows a 44% increase in merged PRs/week for on-call engineers.
Faster RCA means faster resolution:
After deploying Cosmos, Median time-to-first-RCA fell from 30.1 minutes to 6.2 minutes. The bot can begin working immediately, whereas humans have context switching latency. This significantly impacts how long it takes to get to a final resolution: Median time to resolution (MTTR) dropped from 29.5 to 19.9 minutes. (MTTR can be lower than Time to first RCA because alerts on transient errors auto-resolve even before an RCA is complete)
What the metrics don’t capture:
It is difficult to quantify RCA quality, but internal surveys show that on average the Incident Investigator’s RCA is correct more often than an on-call engineer. This is because it consistently executes ALL the steps of an Incident runbook and performs a deep analysis on every single alert, while a human engineer’s RCA quality varies drastically across individuals.
One of the goals behind Cosmos is making out-of-the-box workflows like incident management easy to adopt, so that users don’t have to reinvent the wheel of agent orchestration or hill-climb on agent quality.
The same workflow we use internally can be created directly through the Cosmos Advisor with prompts like:
“Set up an incident investigator expert for me.”
While our production setup uses Slack, PagerDuty, GCP Cloud Logging, Managed Prometheus, and GitHub, the architecture itself is intentionally modular. Different observability stacks can be swapped in while preserving the same overall operational workflow, and the Cosmos Advisor will guide you through customizing the expert for your observability stack.
The important part is not the specific vendor tooling. It is building operational experts that can gather evidence systematically, operate within bounded scope, preserve context, and lead the incident response, while pulling in humans for judgement calls. Original post here:
Authors: Akshay Utture, Sam Chow, and Sophie Reynolds