The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world profess...

Philipp Schmid(@_philschmid)

Philipp Schmid(@_philschmid)2026年6月12日

The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world profess...

8.5Score

TL;DR · AI 摘要

Agents' Last Exam (ALE) 是一个评估代理在 55 个行业中 1000 多个真实专业任务上的表现的基准测试。

核心要点

最佳代理在最难任务中的通过率低于 10%。
使用 GUI 软件的任务占 34%，但代理倾向于使用 CLI 工作绕过。
模型选择对性能影响大于 harness 的选择。

结构提纲

按章节快速跳转。

§引言
介绍 Agents' Last Exam (ALE) 的目标和特点。
·ALE 的任务和来源
ALE 评估代理在 1000 多个真实世界专业任务上的表现，任务来自 55 个行业的专家工作。
·关键发现
最佳代理在最难任务中的通过率低于 10%，且执行错误率较高。
›失败原因分析
47% 的失败是由于策略错误或过早放弃，31% 是领域知识缺失。
·效率与性能
最高效的设置使用 160M tokens 实现 39.6% 的通过率，而最不高效的设置使用 1,373M tokens 实现 40.5% 的通过率。
›模型与 harness 的影响
模型选择对性能影响大于 harness 的选择，最佳 harness 与最差 harness 的差距为 4.9pp。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

Agents' Last Exam (ALE)
- 任务与来源
  - 1000+ 真实任务
  - 55 个行业
- 关键发现
  - 最佳代理通过率 <10%
  - 执行错误率高
  - 模型选择影响大于 harness
- 失败原因
  - 策略错误或过早放弃 (47%)
  - 领域知识缺失 (31%)
  - 执行错误和格式错误 (22%)

金句 / Highlights

值得收藏与分享的关键句。

最佳代理在最难任务中的通过率低于 10%。
— 第 3 段
⬇︎ 下载 PNG 𝕏 分享到 X
使用 GUI 软件的任务占 34%，但代理倾向于使用 CLI 工作绕过。
— 第 5 段
⬇︎ 下载 PNG 𝕏 分享到 X
模型选择对性能影响大于 harness 的选择，最佳 harness 与最差 harness 的差距为 4.9pp。
— 第 6 段
⬇︎ 下载 PNG 𝕏 分享到 X

#Agents#Benchmark#AI#Evaluation

打开原文

Key findings:

Best agents https://t.co/smw3dVgRSL

Philipp Schmid on X: "The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world professional tasks across 55 industries, all sourced from actual expert work. Not synthetic. Not multiple choice. Real deliverables, graded deterministically. Key findings: - Best agents https://t.co/smw3dVgRSL" / X

[](https://x.com/)

Philipp Schmid

@_philschmid

The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world professional tasks across 55 industries, all sourced from actual expert work. Not synthetic. Not multiple choice. Real deliverables, graded deterministically. Key findings: - Best agents st tier, <10% on the hardest - 82% on Terminal-Bench drops to 23% on ALE-CLI eval with the same setup - Hardest tier: most frontier agents hit 0% pass rate - Spending more tokens doesn't improve results - Each run tracks harness, model, pass rate, token usage, and cost Harness vs. model: - Best harness scores 24.0%, worst scores 19.1% (same model). That's a 4.9pp gap. - Model choice drives more performance variation than the harness. - Most efficient setup used 160M tokens for 39.6%. Least efficient burned 1,373M tokens for 40.5%. Where agents break (Agents often say "Done. All checks pass." while the output is wrong) - 47% of failures: wrong strategy or gave up early - 31%: missing domain knowledge - 22%: execution bugs and format errors - 34% of tasks need GUI software, agents avoid it and hack CLI workarounds Very excited to see a benchmark like this. Big kudos to everyone who contributed.

6:36 AM · Jun 12, 2026 3.9K Views

Read 6 replies