T
traeai
登录
返回首页
AI HOT 精选

ExploitGym:AI智能体能否将安全漏洞转化为真实攻击?

8.5Score
ExploitGym:AI智能体能否将安全漏洞转化为真实攻击?

TL;DR · AI 摘要

ExploitGym 是一个包含 898 个真实漏洞的新基准,展示了 AI 剌客如何利用已知软件漏洞生成有效攻击。

核心要点

  • Anthropic 的 Claude Mythos Preview 成功利用了 157 个漏洞实例。
  • OpenAI 的 GPT-5.5 在规定时间内成功利用了 120 个漏洞实例。
  • 即使启用了 ASLR 和 V8 沠箱等标准防御措施,仍有一部分攻击成功。

结构提纲

按章节快速跳转。

  1. 介绍 ExploitGym 的背景和研究团队。

  2. 定义 AI 剌客在漏洞利用中的能力。

  3. 描述 ExploitGym 的实验设置和评估标准。

  4. 展示 AI 剌客在 ExploitGym 上的表现。

  5. 分析实验结果对网络安全的影响。

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • ExploitGym

金句 / Highlights

值得收藏与分享的关键句。

  • Anthropic 的 Claude Mythos Preview 成功利用了 157 个漏洞实例,OpenAI 的 GPT-5.5 在规定时间内成功利用了 120 个漏洞实例。

    第 2 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 即使启用了 ASLR 和 V8 沠箱等标准防御措施,仍有一部分攻击成功。

    第 3 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 这表明深度防御仍然是必要的,但当前的缓解措施不足以抵御具备 AI 能力的对手。

    第 3 段

    ⬇︎ 下载 PNG𝕏 分享到 X
#AI#网络安全#漏洞利用#开源
打开原文

Center for Responsible, Decentralized Intelligence at Berkeley

Image 1: Berkeley RDI Logo

HomeResearchEducationEventsBlogAbout

HomeResearchEducationEventsBlogAboutContact

Image 2: ExploitGym

ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

[Zhun Wang](mailto:zhun.wang@berkeley.edu)1, [Nico Schiller](mailto:nico.schiller@mpi-sp.org)2, [Hongwei Li](mailto:hongwei@ucsb.edu)3, [Srijiith Sesha Narayana](mailto:srijiith.sesha-narayana@mpi-sp.org)2, Milad Nasr 5, Nicholas Carlini 5, Xiangyu Qi 6, Eric Wallace 6, Elie Bursztein 7, Luca Invernizzi 7, Kurt Thomas 7, Yan Shoshitaishvili 4, Wenbo Guo 3, Jingxuan He 1, Thorsten Holz 2, Dawn Song 1

1 UC Berkeley, 2 Max Planck Institute for Security and Privacy, 3 UC Santa Barbara, 4 Arizona State University,

5 Anthropic, 6 OpenAI, 7 Google

May 13, 2026

_(Est. 5-6 minutes read, more details in paper)_

We are a team of researchers led by Berkeley RDI at UC Berkeley, in collaboration with the Max Planck Institute for Security and Privacy, UC Santa Barbara, and Arizona State University, with support from Anthropic, OpenAI, and Google. Together, we have been working on a question the security community has been nervously circling:

_How good are today’s AI agents at turning known software vulnerabilities into working exploits, i.e., real attacks?_

This is one of the most critical questions for measuring the impact of frontier AI on cybersecurity, particularly on the offensive side.

**TL;DR**

ExploitGym is a new benchmark of 898 real-world vulnerabilities spanning userspace programs, Google’s V8 JavaScript engine (the engine behind Chrome), and the Linux kernel. Given a vulnerability and a proof-of-concept input that triggers it, AI agents are tasked with analyzing the vulnerability and crafting a full exploit that achieves unauthorized code execution.

The headline results: Anthropic’s Claude Mythos Preview successfully exploited 157 of those 898 instances, and OpenAI’s GPT-5.5 exploited 120, within the time limit per task. Even when standard security defenses like ASLR or the V8 sandbox were turned on, a meaningful number of exploits still worked. More strikingly, agents sometimes discovered and exploited entirely different vulnerabilities than the ones they were pointed at.

**Key Takeaways**

Autonomous exploitation is no longer hypothetical. Frontier AI agents can already take a bug report and a crashing input, reason about memory layouts, chain together multiple attack primitives, and produce fully working exploits. This kind of multi-step, low-level work has traditionally required deep expertise and significant time investment from human security researchers.

Standard defenses help, but don’t fully stop AI-driven attacks. When mitigations like ASLR, stack canaries, and the V8 heap sandbox were enabled, successes dropped substantially, but didn’t hit zero. Agents found bypasses: partial-pointer overwrites to defeat ASLR, known sandbox-escape techniques for V8, and kernel tricks such as overwriting

plaintext
modprobe_path

and side-channels to sidestep KASLR. This is a clear signal that defense-in-depth remains essential, but current mitigations alone are not enough against AI-capable adversaries.

This is inherently dual-use. That tension is exactly why we built this benchmark. Exploitation sits at the heart of a fundamental tension in cybersecurity. For defenders, it’s about determining whether a vulnerability actually matters in practice. Automated exploit generation could accelerate severity triage, help prioritize patches, and validate whether mitigations actually work. But the same capability lowers the expertise barrier for offensive misuse, making tasks that once required years of specialization accessible to far more actors. Sophisticated attackers could also adapt partial agent-generated trajectories into functioning exploits, using AI as a force multiplier.

As agents grow more capable, this asymmetry will intensify, and the window for proactive governance is narrowing. We believe the responsible path is to measure these capabilities rigorously and openly, so defenders, AI developers, and policymakers can make informed decisions. Our benchmark and results are intended as a foundation for those multi-stakeholder discussions.

**What Is ExploitGym?**

Most existing cybersecurity benchmarks for AI focus on tasks such as finding bugs, writing patches, or solving Capture-the-Flag (CTF) puzzles. Our earlier benchmark, CyberGym, focuses on real-world vulnerability analysis: given a vulnerability description and a codebase, agents must generate proof-of-concept inputs that trigger the bug. That is an important step in understanding whether AI can find or confirm vulnerabilities, but it stops short of the next question: can an agent turn a known bug into a real-world attack?

ExploitGym fills that gap. Each of its 898 tasks provides the agent with three things: the vulnerable source code with build instructions, a proof-of-vulnerability (PoV) input that triggers the bug, and a containerized runtime environment where the agent can interact with the target. The agent’s job is to transform that PoV into a working exploit that achieves unauthorized code execution, concretely, reading a secret flag that is inaccessible through any legitimate interface.

Image 3: Overview of ExploitGym

_Figure 1: Overview of ExploitGym._

The benchmark spans three domains. Userspace programs (520 instances) cover widely used C/C++ projects like FFmpeg and OpenSSL, sourced from Google’s OSS-Fuzz and OSV reports. V8 browser engine tasks (185 instances) target JavaScript engine bugs in Chromium. Linux kernel tasks (193 instances) require full-privilege escalation inside a virtual machine.

Each domain also comes with toggleable security mitigations, so researchers can measure exactly how much standard defenses like ASLR or the V8 sandbox actually slow down an AI attacker.

In addition to validating unauthorized code execution through flag capture, we use an agent-as-a-judge to verify that each exploit actually targets the provided vulnerability. Real-world software often contains multiple flaws, and agents frequently succeed by exploiting a different bug than the intended one. The judge ensures a consistent metric for cross-agent comparison and aligns with the core defensive use case: assessing the real-world severity of a specific known flaw.

**The Main Results**

We tested seven model-agent combinations, all of which ran with safety filters disabled under structured access programs designed for security research. Each agent got two hours per task.

The top performers were Claude Mythos Preview (paired with Claude Code) at 157 successes and GPT-5.5 (paired with Codex CLI) at 120. GPT-5.4 came in at a respectable 54. After that, success counts dropped sharply: Claude Opus 4.6 solved 15, Gemini 3.1 Pro solved 12, and the remaining models were in the single digits.

Breaking results down by task domain reveals a pronounced difficulty gradient. Userspace tasks saw the broadest success. V8 exploitation was substantially harder, with only the top three models making real headway. Kernel exploitation was the sharpest dividing line: only Claude Mythos Preview (12 successes) and GPT-5.5 (22 successes) showed meaningful capability, while no other model managed more than one.

| Model | Agent | Success | Cost (USD) | Time (min) | LLM Calls | | --- | --- | --- | --- | --- | --- | | Total | U | B | K | Succ. | Full | Succ. | Full | Succ. | Full | | Claude Mythos Preview† | Claude Code | 157 | 107 | 38 | 12 | — | — | 54.7 | 102.1 | 225.5 | 289.3 | | Claude Opus 4.6† | Claude Code | 15 | 12 | 2 | 1 | 8.08 | 21.76 | 18.1 | 66.7 | 102.3 | 285.9 | | Claude Opus 4.7 | Claude Code | 7 | 4 | 3 | 0 | 8.64 | 3.40 | 22.1 | 14.4 | 102.0 | 54.0 | | Gemini 3.1 Pro | Gemini CLI | 12 | 10 | 2 | 0 | 8.56 | 9.02 | 51.1 | 75.6 | 169.5 | 174.8 | | GLM-5.1 | Claude Code | 4 | 4 | 0 | 0 | 3.75 | 6.39 | 63.3 | 118.0 | 148.6 | 245.6 | | GPT-5.4 | Codex CLI | 54 | 38 | 15 | 1 | 12.20 | 25.43 | 51.1 | 103.5 | 220.1 | 443.8 | | GPT-5.5‡ | Codex CLI | 120 | 71 | 27 | 22 | 22.99 | 34.55 | 49.6 | 69.8 | 256.8 | 375.4 |

_Table 1: Agent performance under a two-hour timeout. Success is split by domain: userspace (U), V8 (B), and kernel (K). Cost, time, and LLM calls are per-task averages over successful runs (Succ.) and the full benchmark (Full).

† Results obtained in collaboration with Anthropic.

‡ OpenAI's default safety filters block all GPT-5.5 exploit attempts under default prompting._

When standard mitigations were turned back on, success rates dropped across the board, but didn’t vanish. Claude Mythos Preview still succeeded on 25 userspace, 17 V8, and 3 kernel tasks with defenses active. GPT-5.5 retained 10, 3, and 8, respectively.

| Model | Userspace | V8 | Kernel | | --- | --- | --- | --- | | Claude Opus 4.6 | 12 → 0 | 2 → 0 | 1 → 0 | | Claude Opus 4.7 | 4 → 0 | 3 → 0 | 0 → 0 | | Claude Mythos Preview | 107 → 25 | 38 → 17 | 12 → 3 | | Gemini 3.1 Pro | 10 → 0 | 2 → 0 | 0 → 0 | | GLM-5.1 | 4 → 0 | 0 → 0 | 0 → 0 | | GPT-5.4 | 38 → 2 | 15 → 0 | 1 → 1 | | GPT-5.5 | 71 → 10 | 27 → 3 | 22 → 8 |

_Table 2: Mitigation-bypassing exploits. Each cell shows successes without mitigations → with mitigations enabled (ASLR, stack canaries, V8 heap sandbox, KASLR, etc.)._

**The Interesting Bits**

Agents go off-script and find new bugs. One of the most interesting findings is the gap between “captured the flag” and “exploited the intended vulnerability.” Across models, agents frequently achieved code execution through a vulnerability other than the one we provided. The two strongest models show this most clearly: GPT-5.5 captured flags in 210 instances, but only 120 used the intended bug, and Claude Mythos Preview captured 226 flags, but only 157 targeted the right flaw. In summary, 90 and 69 of their solves, respectively, succeeded via an unintended path. In some cases, agents pivoted to an adjacent code path with weaker validation; in others, they concluded the given bug wasn’t exploitable and searched for entirely new attack surfaces, sometimes by auditing source code or even performing dynamic fuzzing. That’s a remarkable display of autonomous security reasoning.

| Model | Flag | Succ. | Rate | | --- | --- | --- | --- | | Claude Opus 4.6 | 36 | 15 | 41.7% | | Claude Opus 4.7 | 9 | 7 | 77.8% | | Claude Mythos Preview | 226 | 157 | 69.5% | | Gemini 3.1 Pro | 18 | 12 | 66.7% | | GLM-5.1 | 11 | 4 | 36.4% | | GPT-5.4 | 65 | 54 | 83.1% | | GPT-5.5 | 210 | 120 | 57.1% |

_Table 3: Flag-to-success rate. Flag counts all instances where the agent captured the flag (any exploitation path); Succ. counts only those where the intended vulnerability was exploited. Rate = Succ. / Flag._

Different models find different exploits. Claude Mythos Preview and GPT-5.5 dominate in total count, but their success sets diverge considerably: 56 targets are solved exclusively by Claude Mythos Preview and 26 exclusively by GPT-5.5, with only 91 shared. The remaining models contribute another 61 successes, most overlapping with the top two, but four solved by them alone. This suggests the models rely on qualitatively different exploitation strategies, and that an ensemble approach could substantially expand coverage beyond what any one model achieves.

Image 4: Venn diagram of successes across models

_Figure 2: Overlap of successfully exploited instances across models._

More budget helps, but only for the best models. When we extended the budget from two to six hours, Claude Mythos Preview kept climbing from 127 to 204 successful exploits with no clear plateau. Claude Opus 4.6, by contrast, flatlined at around 15 within the first 30 minutes. This tells us the frontier models are capable of sustained, multi-stage reasoning that can crack harder problems given enough runway. It also means our two-hour budget likely undercounts what the strongest agents can do.

Image 5: Cumulative successes vs. time budget

_Figure 3: Cumulative successful exploits as the per-task time budget grows to six hours._

**Example: From a 5-Line Crash to Full Code Execution in V8**

To make this concrete, here’s one of the impressive trajectories we observed. GPT-5.4 was given a five-line PoV that triggers an assertion in Maglev, V8’s mid-tier JIT compiler, reported by ClusterFuzz in October 2025, after GPT-5.4’s knowledge cutoff. On the release build, the PoV just throws a benign

plaintext
TypeError

with no visible memory corruption.

From there, the agent independently escalated through a full exploit chain: it identified that the bug depends on receiver shape, crafted an object that tricks Maglev into an out-of-bounds heap read, groomed the heap to leak stable pointers, forged fake V8 string objects to obtain arbitrary native memory reads, leaked libc addresses from the Global Offset Table, and built a signal-return-oriented-programming chain redirecting execution to

plaintext
system("/challenge/catflag")

. Total time: 71 minutes, 229 lines of exploit code.

An important caveat: this worked because we disabled ASLR and the V8 heap sandbox. With those defenses re-enabled, GPT-5.4 could no longer succeed to exploit this specific vulnerability. Modern mitigations remain a meaningful barrier, but an AI agent independently chaining this many primitives on a complex real-world target is a milestone worth noting.

Image 6: GPT-5.4 V8 exploit chain trajectory

_Figure 4: GPT-5.4's V8 exploit chain._

**Why This Matters**

ExploitGym makes concrete what many in the security community have suspected: the gap between “AI can find bugs” and “AI can exploit bugs” is closing fast. This is consistent with findings from our broader analysis of Frontier AI’s Impact on the Cybersecurity Landscape. We frame this as an urgent motivation for two things. First, defenders need to start modeling AI agents as potential attackers when evaluating their security posture. Standard mitigations are still valuable, but they’re no longer sufficient on their own against an adversary that can reason, adapt, and retry at machine speed. Second, responsible AI development needs to account for these capabilities explicitly, through structured access programs, safety filters, and ongoing evaluation.

The benchmark itself is a contribution to both sides of that equation: it gives defenders a way to measure real risk, and it gives AI developers a way to track how their models’ capabilities are evolving in a domain where the stakes are unusually high.

  • * *

_The ExploitGym paper is authored by researchers from UC Berkeley, Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google. The benchmark design and experimental methodology were developed by the academic authors, with industry partners providing model access and feedback. We also thank the GLM team for providing API access._

#### Berkeley RDI

Advancing the science, technology, and education of decentralization and AI to empower a responsible digital economy.

[](https://twitter.com/BerkeleyRDI)[](https://www.youtube.com/@BerkeleyRDI)[](https://discord.gg/NWVpQ9rBvd)[](mailto:rdi@berkeley.edu)

#### Explore

Copyright © 2025 UC Regents; all rights reserved

AI 可能会生成不准确的信息,请核实重要内容

ExploitGym:AI智能体能否将安全漏洞转化为真实攻击? | AI HOT 精选 | traeai