T
traeai
Sign in
返回首页
Alex Albert(@alexalbert__)

Claude Mythos Preview Early Snapshot Achieves Over 2x Time Horizon vs. Next Best Model in METR Evaluation

7.5Score
Claude Mythos Preview Early Snapshot Achieves Over 2x Time Horizon vs. Next Best Model in METR Evaluation

TL;DR · AI Summary

Claude Mythos Preview achieved a time horizon more than 2x that of the next best model on METR's 80% success rate benchmark, with a minimum 50%-time-horizon of 16 hours (95% CI: 8.5–55 hours).

Key Takeaways

  • Claude Mythos Preview achieved ≥16 hours 50%-time-horizon (95% CI: 8.5–55 hrs)
  • Its time horizon exceeds the second-best model by over 2x
  • Evaluation conducted in March 2026, representing an early preview version

Outline

Jump quickly between sections.

  1. METR evaluated an early snapshot of Claude Mythos Preview for risk assessment during a limited window in March 2026.

  2. The evaluation used the 80% success rate benchmark to measure sustained performance in complex, multi-step tasks.

  3. Claude Mythos Preview demonstrated a time horizon of at least 16 hours, exceeding the current best model by over 2x.

  4. The 95% confidence interval spans 8.5 to 55 hours, indicating statistical robustness but constrained by existing task suites.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • Claude Mythos Preview 早期评估表现
    • 时间跨度表现
      • ≥16小时(50% 时间跨度)
      • 是第二名的2倍以上
    • 评估设置
      • 2026年3月,有限窗口期
      • 基于80%成功率基准
    • 置信区间与局限性
      • 95% CI: 8.5–55小时
      • 依赖现有任务集,需新任务扩展

Highlights

Key sentences worth saving and sharing.

  • An early Claude Mythos Preview snapshot has a time horizon of more than 2x the next best model on their 80% success rate benchmark.

    Paragraph 1

    ⬇︎ 下载 PNG𝕏 分享到 X
  • We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
  • The evaluation was conducted during a limited window in March 2026, indicating an early-stage preview.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
#Claude#Mythos#AI Evaluation#Time Horizon
Open original article

Don’t miss what’s happening

An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark

Image 1: Image

Quote

Image 2: Square profile picture

METR

@METR_Evals

4h

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

Image 3: Image

AI may generate inaccurate information. Please verify important content.