概念

SWE-Bench

Q: 什么是 SWE-Bench？

用于评估编程模型性能的基准测试集。

Q: SWE-Bench 最近有什么新动态？

traeai 已收录 5 篇与 SWE-Bench 相关的内容。最新一篇是「When AI Builds Itself: Our progress toward recursive self-improvement」，由 Hacker News Best 发布。

用于评估编程模型性能的基准测试集。

已跟踪 5 条高相关材料

TraeAI 观察

如果只读 3 篇

When AI Builds Itself: Our progress toward recursive self-improvement

Hacker News Best · 9.2 分

AI递归自我改进正加速到来，Anthropic内部数据显示工程师代码产出提升8倍，模型可靠任务时长每4个月翻倍，预计2027年可处理周级任务。

Cohere 发布首个开源编程模型「North Mini Code」小参数、高效率、专做 Agent 编程参数：MoE 架构(30B, 3B)，128专家，每 token 激活 8 个上下文：...

meng shao(@shao__meng) · 8.5 分

Cohere 发布开源编程模型 North Mini Code，采用 MoE 架构，专为 Agent 编程优化，性能接近大模型。

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue an...

Scott Wu(@ScottWu46) · 8.5 分

FrontierCode 是一种新的代码评估基准，通过多维度评价模型生成代码的质量，显著减少误判并提升评估标准。

When AI Builds Itself: Our Progress Toward Recursive Self-Improvement

Hacker News Best6月5日5602 字 (约 23 分钟)

Recursive self-improvement is accelerating; Anthropic data shows an 8x increase in engineer code output and AI reliable task duration doubling every 4 months, projecting week-long task capability by 2027.

入选理由：Anthropic工程师季度代码产出较2021-2025年均值提升8倍，AI已实质性加速研发。

FeaturedArticle#Recursive Self-Improvement#Anthropic#AI Agents#SWE-bench#METR英文

Cohere 发布首个开源编程模型「North Mini Code」小参数、高效率、专做 Agent 编程参数：MoE 架构(30B, 3B)，128专家，每 token 激活 8 个上下文：...

meng shao(@shao__meng)6月10日793 字 (约 4 分钟)

Cohere 发布开源编程模型 North Mini Code，采用 MoE 架构，专为 Agent 编程优化，性能接近大模型。

入选理由：North Mini Code 使用 MoE 架构，参数规模为 30B 和 3B，每 token 激活 8 个专家。

FeaturedTweet#Cohere#开源模型#编程模型#Agent#MoE中英混合

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue an...

Scott Wu(@ScottWu46)6月10日239 字 (约 1 分钟)

FrontierCode 是一种新的代码评估基准，通过多维度评价模型生成代码的质量，显著减少误判并提升评估标准。

入选理由：FrontierCode 评估标准比传统单元测试更全面，涵盖代码风格、可维护性等维度。

FeaturedTweet#AI#代码评估#模型测试#开源英文

Can LLMs Generate Enterprise Quality Code? — Prasenjit Sarkar, Sonar

AI Engineer6月1日3517 字 (约 15 分钟)

While LLMs achieve high functional pass rates (e.g., Gemini 3.1 Pro at 84.17%), Sonar’s evaluation of 4,444 Java tasks reveals critical maintainability and security flaws—614 bugs per million lines, verbose code, and high cyclomatic complexity.

入选理由：Gemini 3.1 Pro在SWE Bench测试中功能通过率达84.17%，但生成代码冗长（307,000行）且复杂度高（圈复杂度234）。

FeaturedVideo#LLM#Code Quality#Sonar#Enterprise Development英文

Import AI 455: AI systems are about to start building themselves.

Import AI5月9日2928 字 (约 12 分钟)

AI系统即将实现自我构建，预计到2028年可能实现无人参与的AI研发。

入选理由：无人参与的AI研发可能在2028年前实现，概率超60%

FeaturedArticle#AI#自动化#研发中文

跨材料问答 · SWE-Bench

回答基于：SWE-Bench 相关 5 条材料

SWE-Bench

TraeAI 观察

如果只读 3 篇

相关材料

When AI Builds Itself: Our Progress Toward Recursive Self-Improvement

Cohere 发布首个开源编程模型「North Mini Code」 小参数、高效率、专做 Agent 编程 参数：MoE 架构(30B, 3B)，128专家，每 token 激活 8 个 上下文：...

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue an...

Can LLMs Generate Enterprise Quality Code? — Prasenjit Sarkar, Sonar

Import AI 455: AI systems are about to start building themselves.

跨材料问答 · SWE-Bench

Cohere 发布首个开源编程模型「North Mini Code」小参数、高效率、专做 Agent 编程参数：MoE 架构(30B, 3B)，128专家，每 token 激活 8 个上下文：...