T
traeai
Sign in
返回首页
Viking(@vikingmute)

昨天又有一个新的 coding benchmark DeepSWE:https://t.co/3V65OaHScM 创新是无污染的任务,就是所有任务全新原创,从零编写,未基于现有 PR/Commi...

8.5Score
昨天又有一个新的 coding benchmark  DeepSWE:https://t.co/3V65OaHScM

创新是无污染的任务,就是所有任务全新原创,从零编写,未基于现有 PR/Commi...

TL;DR · AI Summary

DeepSWE 是一个全新的编程基准测试,涵盖多种语言和真实世界复杂度,参考解决方案平均需要修改 668 行代码。

Key Takeaways

  • DeepSWE 是一个全新的编程基准测试,涵盖多种语言和真实世界复杂度。
  • 参考解决方案平均需要修改 668 行代码。
  • GPT-5.5 xhigh 在测试中排名第一,通过率为 90%。

Outline

Jump quickly between sections.

  1. 介绍 DeepSWE 的背景和特点。

  2. 强调任务的创新性、多样性、真实世界复杂度和代码修改量。

  3. 展示 GPT-5.5 xhigh 在测试中的表现。

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • DeepSWE

Highlights

Key sentences worth saving and sharing.

#DeepSWE#编程基准测试#GPT-5.5#多语言#真实世界复杂度
Open original article

Viking on X: "Another new coding benchmark DeepSWE is out: https://t.co/cZLBxcjmXL. Innovation here means all tasks are completely original, written from scratch, not based on existing PR/Commits, and won't be remembered by the model's pre-trained data. There's also diversity (across various languages) and real-world complexity; the average number of lines of code needed to modify a reference solution is 668.

The final ranking is shown below: https://t.co/MDLMuBBRrF" / X

Don’t miss what’s happening

Image 1

Viking

@vikingmute

Show translation

Another new coding benchmark DeepSWE is out: https://deepswe.datacurve.ai/blog. Innovation here means all tasks are completely original, written from scratch, not based on existing PR/Commits, and won't be remembered by the model's pre-trained data. There's also diversity (across various languages) and real-world complexity; the average number of lines of code needed to modify a reference solution is 668. The final ranking is shown below: gpt-5.5 xhigh first, gpt-5.4 xhigh second. This percentage is the pass rate, it seems the task is indeed quite challenging, with very low pass rates for the subsequent models. You can check them out yourself; they mostly align with my expectations. I didn't expect Xiaomi's to be that good; I haven't used it yet.

Image 2: Image

2:55 AM · May 28, 2026

·

2,156 Views

8

1

5

11

Read 8 replies

AI may generate inaccurate information. Please verify important content.