T
traeai
Sign in
返回首页
Scott Wu(@ScottWu46)

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue an...

8.5Score

TL;DR · AI Summary

FrontierCode 是一种新的代码评估基准,通过多维度评价模型生成代码的质量,显著减少误判并提升评估标准。

Key Takeaways

  • FrontierCode 评估标准比传统单元测试更全面,涵盖代码风格、可维护性等维度。
  • Opus 4.8 模型在 FrontierCode 上得分仅 13%,表现远低于传统评估方式。
  • FrontierCode 由开源社区维护者耗时 40 小时以上构建,任务难度和质量要求更高。

Outline

Jump quickly between sections.

  1. 传统 SWE-Bench 评估方式存在局限,仅关注单元测试通过率。

  2. ·FrontierCode 的提出

    FrontierCode 是一种新的代码评估基准,通过多维度评价模型生成代码的质量。

  3. FrontierCode 评估代码风格、可维护性、副作用等多个维度,提升评估标准。

  4. Opus 4.8 模型在 FrontierCode 上得分仅 13%,表现远低于传统评估方式。

  5. FrontierCode 由开源社区维护者耗时 40 小时以上构建,任务难度和质量要求更高。

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • FrontierCode
    • 评估维度
      • 代码风格
      • 可维护性
      • 副作用
    • 模型表现
      • Opus 4.8 得分 13%
    • 构建过程
      • 40+ 小时由开源维护者构建

Highlights

Key sentences worth saving and sharing.

  • FrontierCode 是一种新的代码评估基准,通过多维度评价模型生成代码的质量,显著减少误判并提升评估标准。

    引言

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Opus 4.8 模型在 FrontierCode 上得分仅 13%,表现远低于传统评估方式。

    模型表现

    ⬇︎ 下载 PNG𝕏 分享到 X
  • FrontierCode 由开源社区维护者耗时 40 小时以上构建,任务难度和质量要求更高。

    构建过程

    ⬇︎ 下载 PNG𝕏 分享到 X
#AI#代码评估#模型测试#开源
Open original article

Scott Wu on X: "SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on" / X

Scott Wu

@ScottWu46

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on a number of other axes, including scope, coding style, and unintended side effects. The result is our new benchmark FrontierCode - which has ~80% fewer false positives and for which the best model (Opus 4.8) only scores 13%! "Where others grade like a CI, FrontierCode grades like a tech lead."

Cognition

@cognition

Jun 8

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

7:54 PM · Jun 8, 2026

81.5K

Views

3

7

37

4

1

41

6

2

624

8

128

Read 37 replies

AI may generate inaccurate information. Please verify important content.