昨天又有一个新的 coding benchmark DeepSWE:https://t.co/3V65OaHScM 创新是无污染的任务,就是所有任务全新原创,从零编写,未基于现有 PR/Commi...

TL;DR · AI Summary
DeepSWE 是一个全新的编程基准测试,涵盖多种语言和真实世界复杂度,参考解决方案平均需要修改 668 行代码。
Key Takeaways
- DeepSWE 是一个全新的编程基准测试,涵盖多种语言和真实世界复杂度。
- 参考解决方案平均需要修改 668 行代码。
- GPT-5.5 xhigh 在测试中排名第一,通过率为 90%。
Outline
Jump quickly between sections.
- §引言
介绍 DeepSWE 的背景和特点。
强调任务的创新性、多样性、真实世界复杂度和代码修改量。
- ·测试结果
展示 GPT-5.5 xhigh 在测试中的表现。
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- DeepSWE
Highlights
Key sentences worth saving and sharing.
DeepSWE 是一个全新的编程基准测试,涵盖多种语言和真实世界复杂度。
参考解决方案平均需要修改 668 行代码。
GPT-5.5 xhigh 在测试中排名第一,通过率为 90%。
Viking on X: "Another new coding benchmark DeepSWE is out: https://t.co/cZLBxcjmXL. Innovation here means all tasks are completely original, written from scratch, not based on existing PR/Commits, and won't be remembered by the model's pre-trained data. There's also diversity (across various languages) and real-world complexity; the average number of lines of code needed to modify a reference solution is 668.
The final ranking is shown below: https://t.co/MDLMuBBRrF" / X
Don’t miss what’s happening

Show translation
Another new coding benchmark DeepSWE is out: https://deepswe.datacurve.ai/blog. Innovation here means all tasks are completely original, written from scratch, not based on existing PR/Commits, and won't be remembered by the model's pre-trained data. There's also diversity (across various languages) and real-world complexity; the average number of lines of code needed to modify a reference solution is 668. The final ranking is shown below: gpt-5.5 xhigh first, gpt-5.4 xhigh second. This percentage is the pass rate, it seems the task is indeed quite challenging, with very low pass rates for the subsequent models. You can check them out yourself; they mostly align with my expectations. I didn't expect Xiaomi's to be that good; I haven't used it yet.
·
8
1
5
11
Read 8 replies