SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue an...
TL;DR · AI Summary
FrontierCode 是一种新的代码评估基准,通过多维度评价模型生成代码的质量,显著减少误判并提升评估标准。
Key Takeaways
- FrontierCode 评估标准比传统单元测试更全面,涵盖代码风格、可维护性等维度。
- Opus 4.8 模型在 FrontierCode 上得分仅 13%,表现远低于传统评估方式。
- FrontierCode 由开源社区维护者耗时 40 小时以上构建,任务难度和质量要求更高。
Outline
Jump quickly between sections.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- FrontierCode
- 评估维度
- 代码风格
- 可维护性
- 副作用
- 模型表现
- Opus 4.8 得分 13%
- 构建过程
- 40+ 小时由开源维护者构建
Highlights
Key sentences worth saving and sharing.
FrontierCode 是一种新的代码评估基准,通过多维度评价模型生成代码的质量,显著减少误判并提升评估标准。
Opus 4.8 模型在 FrontierCode 上得分仅 13%,表现远低于传统评估方式。
FrontierCode 由开源社区维护者耗时 40 小时以上构建,任务难度和质量要求更高。
Scott Wu on X: "SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on" / X
Scott Wu
@ScottWu46
SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on a number of other axes, including scope, coding style, and unintended side effects. The result is our new benchmark FrontierCode - which has ~80% fewer false positives and for which the best model (Opus 4.8) only scores 13%! "Where others grade like a CI, FrontierCode grades like a tech lead."
Cognition
@cognition
Jun 8
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
7:54 PM · Jun 8, 2026
81.5K
Views
3
7
37
4
1
41
6
2
624
8
128
Read 37 replies