T
traeai
登录
返回首页
Philipp Schmid(@_philschmid)

The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world profess...

8.5Score
The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world profess...

TL;DR · AI 摘要

Agents' Last Exam (ALE) 是一个评估代理在 55 个行业中 1000 多个真实专业任务上的表现的基准测试。

核心要点

  • 最佳代理在最难任务中的通过率低于 10%。
  • 使用 GUI 软件的任务占 34%,但代理倾向于使用 CLI 工作绕过。
  • 模型选择对性能影响大于 harness 的选择。

结构提纲

按章节快速跳转。

  1. 介绍 Agents' Last Exam (ALE) 的目标和特点。

  2. ALE 评估代理在 1000 多个真实世界专业任务上的表现,任务来自 55 个行业的专家工作。

  3. 最佳代理在最难任务中的通过率低于 10%,且执行错误率较高。

  4. 47% 的失败是由于策略错误或过早放弃,31% 是领域知识缺失。

  5. 最高效的设置使用 160M tokens 实现 39.6% 的通过率,而最不高效的设置使用 1,373M tokens 实现 40.5% 的通过率。

  6. 模型选择对性能影响大于 harness 的选择,最佳 harness 与最差 harness 的差距为 4.9pp。

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • Agents' Last Exam (ALE)
    • 任务与来源
      • 1000+ 真实任务
      • 55 个行业
    • 关键发现
      • 最佳代理通过率 <10%
      • 执行错误率高
      • 模型选择影响大于 harness
    • 失败原因
      • 策略错误或过早放弃 (47%)
      • 领域知识缺失 (31%)
      • 执行错误和格式错误 (22%)

金句 / Highlights

值得收藏与分享的关键句。

#Agents#Benchmark#AI#Evaluation
打开原文

Key findings:

  • Best agents https://t.co/smw3dVgRSL

Philipp Schmid on X: "The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world professional tasks across 55 industries, all sourced from actual expert work. Not synthetic. Not multiple choice. Real deliverables, graded deterministically. Key findings: - Best agents https://t.co/smw3dVgRSL" / X

[](https://x.com/)

Philipp Schmid

Image 2: Google AI Studio

@_philschmid

The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world professional tasks across 55 industries, all sourced from actual expert work. Not synthetic. Not multiple choice. Real deliverables, graded deterministically. Key findings: - Best agents st tier, <10% on the hardest - 82% on Terminal-Bench drops to 23% on ALE-CLI eval with the same setup - Hardest tier: most frontier agents hit 0% pass rate - Spending more tokens doesn't improve results - Each run tracks harness, model, pass rate, token usage, and cost Harness vs. model: - Best harness scores 24.0%, worst scores 19.1% (same model). That's a 4.9pp gap. - Model choice drives more performance variation than the harness. - Most efficient setup used 160M tokens for 39.6%. Least efficient burned 1,373M tokens for 40.5%. Where agents break (Agents often say "Done. All checks pass." while the output is wrong) - 47% of failures: wrong strategy or gave up early - 31%: missing domain knowledge - 22%: execution bugs and format errors - 34% of tasks need GUI software, agents avoid it and hack CLI workarounds Very excited to see a benchmark like this. Big kudos to everyone who contributed.

Image 3
Image 4
Image 5
Image 6

6:36 AM · Jun 12, 20263.9K Views

Read 6 replies

AI 可能会生成不准确的信息,请核实重要内容