昨天又有一个新的 coding benchmark  DeepSWE：https://t.co/3V65OaHScM

创新是无污染的任务，就是所有任务全新原创，从零编写，未基于现有 PR/Commi...

Q: 引言

介绍 DeepSWE 的背景和特点。

Viking(@vikingmute)

Viking(@vikingmute)2026年5月28日

昨天又有一个新的 coding benchmark DeepSWE：https://t.co/3V65OaHScM 创新是无污染的任务，就是所有任务全新原创，从零编写，未基于现有 PR/Commi...

8.5Score

TL;DR · AI Summary

DeepSWE 是一个全新的编程基准测试，涵盖多种语言和真实世界复杂度，参考解决方案平均需要修改 668 行代码。

Key Takeaways

DeepSWE 是一个全新的编程基准测试，涵盖多种语言和真实世界复杂度。
参考解决方案平均需要修改 668 行代码。
GPT-5.5 xhigh 在测试中排名第一，通过率为 90%。

Outline

Jump quickly between sections.

§引言
介绍 DeepSWE 的背景和特点。
·DeepSWE 特点
强调任务的创新性、多样性、真实世界复杂度和代码修改量。
·测试结果
展示 GPT-5.5 xhigh 在测试中的表现。

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

DeepSWE

Highlights

Key sentences worth saving and sharing.

DeepSWE 是一个全新的编程基准测试，涵盖多种语言和真实世界复杂度。
— 第 2 段
⬇︎ 下载 PNG 𝕏 分享到 X
参考解决方案平均需要修改 668 行代码。
— 第 2 段
⬇︎ 下载 PNG 𝕏 分享到 X
GPT-5.5 xhigh 在测试中排名第一，通过率为 90%。
— 第 3 段
⬇︎ 下载 PNG 𝕏 分享到 X

#DeepSWE#编程基准测试#GPT-5.5#多语言#真实世界复杂度

Open original article

Viking on X: "Another new coding benchmark DeepSWE is out: https://t.co/cZLBxcjmXL. Innovation here means all tasks are completely original, written from scratch, not based on existing PR/Commits, and won't be remembered by the model's pre-trained data. There's also diversity (across various languages) and real-world complexity; the average number of lines of code needed to modify a reference solution is 668.

The final ranking is shown below: https://t.co/MDLMuBBRrF" / X

Don’t miss what’s happening

Viking

@vikingmute

Show translation

Another new coding benchmark DeepSWE is out: https://deepswe.datacurve.ai/blog. Innovation here means all tasks are completely original, written from scratch, not based on existing PR/Commits, and won't be remembered by the model's pre-trained data. There's also diversity (across various languages) and real-world complexity; the average number of lines of code needed to modify a reference solution is 668. The final ranking is shown below: gpt-5.5 xhigh first, gpt-5.4 xhigh second. This percentage is the pass rate, it seems the task is indeed quite challenging, with very low pass rates for the subsequent models. You can check them out yourself; they mostly align with my expectations. I didn't expect Xiaomi's to be that good; I haven't used it yet.

2:55 AM · May 28, 2026

·

2,156 Views

8

1

5

11

Read 8 replies