SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue an...

Scott Wu(@ScottWu46)

Scott Wu(@ScottWu46)2026年6月8日

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue an...

8.5Score

TL;DR · AI Summary

FrontierCode 是一种新的代码评估基准，通过多维度评价模型生成代码的质量，显著减少误判并提升评估标准。

Key Takeaways

FrontierCode 评估标准比传统单元测试更全面，涵盖代码风格、可维护性等维度。
Opus 4.8 模型在 FrontierCode 上得分仅 13%，表现远低于传统评估方式。
FrontierCode 由开源社区维护者耗时 40 小时以上构建，任务难度和质量要求更高。

Outline

Jump quickly between sections.

§引言
传统 SWE-Bench 评估方式存在局限，仅关注单元测试通过率。
·FrontierCode 的提出
FrontierCode 是一种新的代码评估基准，通过多维度评价模型生成代码的质量。
›评估维度
FrontierCode 评估代码风格、可维护性、副作用等多个维度，提升评估标准。
›模型表现
Opus 4.8 模型在 FrontierCode 上得分仅 13%，表现远低于传统评估方式。
›构建过程
FrontierCode 由开源社区维护者耗时 40 小时以上构建，任务难度和质量要求更高。

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

FrontierCode
- 评估维度
  - 代码风格
  - 可维护性
  - 副作用
- 模型表现
  - Opus 4.8 得分 13%
- 构建过程
  - 40+ 小时由开源维护者构建

Highlights

Key sentences worth saving and sharing.

FrontierCode 是一种新的代码评估基准，通过多维度评价模型生成代码的质量，显著减少误判并提升评估标准。
— 引言
⬇︎ 下载 PNG 𝕏 分享到 X
Opus 4.8 模型在 FrontierCode 上得分仅 13%，表现远低于传统评估方式。
— 模型表现
⬇︎ 下载 PNG 𝕏 分享到 X
FrontierCode 由开源社区维护者耗时 40 小时以上构建，任务难度和质量要求更高。
— 构建过程
⬇︎ 下载 PNG 𝕏 分享到 X

#AI#代码评估#模型测试#开源

Open original article

Scott Wu on X: "SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on" / X

Scott Wu

@ScottWu46

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on a number of other axes, including scope, coding style, and unintended side effects. The result is our new benchmark FrontierCode - which has ~80% fewer false positives and for which the best model (Opus 4.8) only scores 13%! "Where others grade like a CI, FrontierCode grades like a tech lead."

Cognition

@cognition

Jun 8

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

7:54 PM · Jun 8, 2026

81.5K

Views

3

7

37

4

1

41

6

2

624

8

128

Read 37 replies