Claude Mythos Preview 早期快照在 METR 评估中时间跨度超第二名两倍

Alex Albert(@alexalbert__)

Alex Albert(@alexalbert__)2026年5月8日

Claude Mythos Preview Early Snapshot Achieves Over 2x Time Horizon vs. Next Best Model in METR Evaluation

7.5Score

TL;DR · AI Summary

Claude Mythos Preview achieved a time horizon more than 2x that of the next best model on METR's 80% success rate benchmark, with a minimum 50%-time-horizon of 16 hours (95% CI: 8.5–55 hours).

Key Takeaways

Claude Mythos Preview achieved ≥16 hours 50%-time-horizon (95% CI: 8.5–55 hrs)
Its time horizon exceeds the second-best model by over 2x
Evaluation conducted in March 2026, representing an early preview version

Outline

Jump quickly between sections.

§Evaluation Context & Time Window
METR evaluated an early snapshot of Claude Mythos Preview for risk assessment during a limited window in March 2026.
·Core Metric: 80% Success Rate Benchmark
The evaluation used the 80% success rate benchmark to measure sustained performance in complex, multi-step tasks.
›Breakthrough Time Horizon Performance
Claude Mythos Preview demonstrated a time horizon of at least 16 hours, exceeding the current best model by over 2x.
·Confidence Interval & Measurement Limits
The 95% confidence interval spans 8.5 to 55 hours, indicating statistical robustness but constrained by existing task suites.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

Claude Mythos Preview 早期评估表现
- 时间跨度表现
  - ≥16小时（50% 时间跨度）
  - 是第二名的2倍以上
- 评估设置
  - 2026年3月，有限窗口期
  - 基于80%成功率基准
- 置信区间与局限性
  - 95% CI: 8.5–55小时
  - 依赖现有任务集，需新任务扩展

Highlights

Key sentences worth saving and sharing.

An early Claude Mythos Preview snapshot has a time horizon of more than 2x the next best model on their 80% success rate benchmark.
— Paragraph 1
⬇︎ 下载 PNG 𝕏 分享到 X
We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite.
— Paragraph 2
⬇︎ 下载 PNG 𝕏 分享到 X
The evaluation was conducted during a limited window in March 2026, indicating an early-stage preview.
— Paragraph 2
⬇︎ 下载 PNG 𝕏 分享到 X

#Claude#Mythos#AI Evaluation#Time Horizon

Open original article

Don’t miss what’s happening

An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark

Quote

METR

@METR_Evals

4h

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.