Claude Mythos Preview Early Snapshot Achieves Over 2x Time Horizon vs. Next Best Model in METR Evaluation

TL;DR · AI Summary
Claude Mythos Preview achieved a time horizon more than 2x that of the next best model on METR's 80% success rate benchmark, with a minimum 50%-time-horizon of 16 hours (95% CI: 8.5–55 hours).
Key Takeaways
- Claude Mythos Preview achieved ≥16 hours 50%-time-horizon (95% CI: 8.5–55 hrs)
- Its time horizon exceeds the second-best model by over 2x
- Evaluation conducted in March 2026, representing an early preview version
Outline
Jump quickly between sections.
METR evaluated an early snapshot of Claude Mythos Preview for risk assessment during a limited window in March 2026.
The evaluation used the 80% success rate benchmark to measure sustained performance in complex, multi-step tasks.
Claude Mythos Preview demonstrated a time horizon of at least 16 hours, exceeding the current best model by over 2x.
The 95% confidence interval spans 8.5 to 55 hours, indicating statistical robustness but constrained by existing task suites.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- Claude Mythos Preview 早期评估表现
- 时间跨度表现
- ≥16小时(50% 时间跨度)
- 是第二名的2倍以上
- 评估设置
- 2026年3月,有限窗口期
- 基于80%成功率基准
- 置信区间与局限性
- 95% CI: 8.5–55小时
- 依赖现有任务集,需新任务扩展
Highlights
Key sentences worth saving and sharing.
An early Claude Mythos Preview snapshot has a time horizon of more than 2x the next best model on their 80% success rate benchmark.
We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite.
The evaluation was conducted during a limited window in March 2026, indicating an early-stage preview.
Don’t miss what’s happening
An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark
Quote

METR
@METR_Evals
4h
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.