GPT-5.5 最近有什么新动态？

traeai 已收录 30 篇与 GPT-5.5 相关的内容。最新一篇是「OpenAI's GPT-5.5 and Codex Reach General Availability on Amazon Bedrock」，由 InfoQ 发布。

模型

GPT-5.5

Q: 什么是 GPT-5.5？

OpenAI 发布的前沿语言模型。

别名：GPT5.5

OpenAI 发布的前沿语言模型。

已跟踪 30 条高相关材料

TraeAI 观察

如果只读 3 篇

OpenAI's GPT-5.5 and Codex Reach General Availability on Amazon Bedrock

InfoQ · 8.5 分

OpenAI 的 GPT-5.5 和 Codex 现已通过 Amazon Bedrock 提供，支持企业级治理和合规性。

Day 0 Anthropic Fable 5 in ParseBench: We tested the model's advancements when it comes to document ...

LlamaIndex 🦙(@llama_index) · 8.5 分

Anthropic Fable 5在文档理解任务中表现优异，内容忠实度达90.02%，显著优于Gemini 3 Flash和GPT-5.5。

Claude Opus 4.8 debuts on Agent Arena tied #1 with GPT 5.5 (High) for Thinking & ranked #8 for Non-T...

lmarena.ai(@lmarena_ai) · 8.5 分

Claude Opus 4.8 在 Agent Arena 上与 GPT 5.5 并列第一，但在非思考任务中排名第八。

OpenAI's GPT-5.5 and Codex Reach General Availability on Amazon Bedrock

InfoQ6月11日1060 字 (约 5 分钟)

OpenAI 的 GPT-5.5 和 Codex 现已通过 Amazon Bedrock 提供，支持企业级治理和合规性。

入选理由：GPT-5.5 和 Codex 现在可通过 Amazon Bedrock 使用，无需引入新供应商。

FeaturedArticle#OpenAI#Amazon Bedrock#AI#云服务英文

Claude Opus 4.8 debuts on Agent Arena tied #1 with GPT 5.5 (High) for Thinking & ranked #8 for Non-T...

lmarena.ai(@lmarena_ai)6月10日267 字 (约 2 分钟)

Claude Opus 4.8 在 Agent Arena 上与 GPT 5.5 并列第一，但在非思考任务中排名第八。

入选理由：Claude Opus 4.8 在开启思考模式时表现优于 4.7 版本。

FeaturedTweet#Claude#GPT#Agent Arena#模型评估英文

Day 0 Anthropic Fable 5 in ParseBench: We tested the model's advancements when it comes to document ...

LlamaIndex 🦙(@llama_index)6月10日185 字 (约 1 分钟)

Anthropic Fable 5在文档理解任务中表现优异，内容忠实度达90.02%，显著优于Gemini 3 Flash和GPT-5.5。

入选理由：Anthropic Fable 5在内容忠实度指标上达到90.02%，领先Gemini 3 Flash和GPT-5.5。

FeaturedTweet#Anthropic#模型#文档理解#AI英文

Introducing new capabilities to GPT-Rosalind

OpenAI Blog6月4日2278 字 (约 10 分钟)

OpenAI introduces a new model update to GPT-Rosalind, designed for life sciences research at enterprise scale. The updated model combines GPT-5.5's agentic coding and tool-use capabilities with stronger model intelligence in core drug-discovery domains such as medicinal chemistry and genomics. GPT-Rosalind shows broad performance gains on research tasks from biology experts, complex medicinal chemistry queries, quantitative biology, and wet lab troubleshooting.

入选理由：GPT-Rosalind combines GPT-5.5's agentic coding and tool-use capabilities with stronger model intelligence in core drug-discovery domains.

FeaturedArticle#GPT-Rosalind#life sciences#research#performance improvement#model update英文

Hands-on Test of MiniMax M3: 74 Logos on Huang’s PPT, I Thought It Would Be Hard for It

量子位6月2日3891 字 (约 16 分钟)

MiniMax M3 is China's first open-source model with simultaneous long-context, multimodal, and coding capabilities; it scored 59% on SWE-Bench Pro, outperforming GPT-5.5 and Gemini 3.1 Pro, with efficiency boosted to 1/20 of the previous generation.

入选理由：M3在SWE-Bench Pro上得分59%，超越GPT-5.5和Gemini 3.1 Pro

FeaturedArticle#MiniMax#Open Source Model#Multimodal#Coding Capability#AI Evaluation中文

OpenAI Models and Codex on Amazon Bedrock Are Now Generally Available

AWS Machine Learning Blog6月1日965 字 (约 4 分钟)

OpenAI’s GPT-5.5, GPT-5.4, and Codex are now generally available on Amazon Bedrock for production deployment, matching OpenAI’s pricing and inheriting AWS security & governance frameworks.

入选理由：GPT-5.5 在 Bedrock 上提供与 OpenAI 直接调用相同的每 token 定价，无额外费用。

FeaturedArticle#OpenAI#Amazon Bedrock#GPT-5.5#Codex#AI Inference英文

Finally a Good Benchmark (Deep Suite)

Matthew Berman5月28日3734 字 (约 15 分钟)

Deep Suite is a software engineering benchmark designed to provide more accurate model evaluations than existing public benchmarks. It offers four major advantages: contamination-free tasks, high diversity, real-world complexity, and reliable verification. According to Deep Suite's testing, GPT 5.5 outperforms Opus 4.7.

入选理由：Deep Suite 通过手写任务避免了模型在预训练期间看到解决方案的问题。

FeaturedVideo#AI#Machine Learning#Deep Learning#Natural Language Processing#Software Engineering中文

I think Anthropic and OpenAI have found product-market fit

Hacker News Best5月28日1867 字 (约 8 分钟)

文章认为 Anthropic 和 OpenAI 已经找到了产品市场契合点，通过提高 API 价格锁定企业客户。

入选理由：Anthropic 和 OpenAI 都提高了 API 价格，锁定企业客户。

FeaturedArticle#Anthropic#OpenAI#API 价格#企业客户#产品市场契合点英文

https://t.co/o6CEQEW0V4

向阳乔木(@vista8)5月28日2575 字 (约 11 分钟)

Every公司的CEO Dan Shipper分享了AI工具在实际工作中的应用，揭示了AI越强反而使人更忙的现象，并预测未来工作方式将向公司级和工作操作系统方向发展。

入选理由：AI工具在实际工作中存在缺陷，无法主动发现问题并重新定义。

FeaturedTweet#AI#Every公司#Dan Shipper#工作方式变革#SaaS中文

Underappreciated how capable GPT-5.5 is at cybersecurity:

Greg Brockman(@gdb)5月28日94 字 (约 1 分钟)

GPT-5.5 被低估了其在网络安全领域的强大能力，成功发现了一个27年的远程代码执行漏洞。

入选理由：GPT-5.5 发现了一个1999年引入的27年-old RCE漏洞。

FeaturedTweet#GPT-5.5#网络安全#RCE漏洞#人工智能英文

Warp’s big bet on building open source with GPT-5.5

OpenAI Blog5月28日884 字 (约 4 分钟)

Warp 使用 GPT-5.5 推动开源软件开发，通过 Open Agentic Development 模型，人类定义目标，AI 代理执行任务，提高开发效率和代码质量。

入选理由：Warp 引入 Open Agentic Development 模型，AI 代理协助编写代码，提高开发效率。

FeaturedArticle#Warp#GPT-5.5#Open Agentic Development#Oz#开源软件开发英文

I think Anthropic and OpenAI have found product-market fit

Simon Willison's Weblog5月28日1867 字 (约 8 分钟)

Anthropic和OpenAI通过调整定价策略，表明它们已经找到了产品市场契合点，企业客户现在按API价格付费，而非之前的折扣价。

入选理由：Anthropic和OpenAI将企业客户的定价从折扣价改为API价格。

FeaturedArticle#Anthropic#OpenAI#产品市场契合点#定价策略#企业客户中文

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Hugging Face Blog5月27日861 字 (约 4 分钟)

ITBench-AA is a new benchmark series evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% on ITBench-AA's SRE tasks benchmark model performance on Kubernetes incident response, where models and agents must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure.

入选理由：Claude Opus 4.7 在 ITBench-AA 中表现最佳，得分为 47%

FeaturedArticle#ITBench-AA#Site Reliability Engineering#Frontier Models#IBM#Kubernetes中文

The Latest Codex Updates and The Truth about Opus 4.8

Riley Brown6月1日6488 字 (约 26 分钟)

Anthropic released Claude Opus 4.8, but experts like Greg Eisenberg and Matt Wolf argue it’s nearly indistinguishable from 4.7, signaling a shift to iPhone-style incremental upgrades; Deep Suite data shows GPT 5.5 outperforms Opus 4.8 in coding tasks at lower cost and token usage, while OpenAI’s Codex saw undisclosed but impactful updates.

入选理由：Opus 4.8与4.7对比，作者及多位专家均无法分辨性能差异，体现模型演进进入‘iPhone式’渐进阶段。

FeaturedVideo#AI Models#Claude#GPT-5.5#Codex#SWEBench英文

Open source is going to win

Paul Couvert(@itsPaulAi)6月2日203 字 (约 1 分钟)

The open-weight model MiniMax M3 has reached performance comparable to GPT-5.5 and Opus 4.7, outperforming Gemini 3.1 Pro in coding tasks, and costs 10x less to use, with weights to be released on Hugging Face next week.

入选理由：MiniMax M3在SWE Bench Pro上与GPT-5.5性能相当

FeaturedTweet#Open Source#AI Model#MiniMax M3#GPT-5.5#Gemini英文

OpenAI + Amazon Bedrock

Greg Brockman(@gdb)6月2日74 字 (约 1 分钟)

OpenAI's GPT-5.5, GPT-5.4, and Codex models are now generally available on Amazon Bedrock, supporting auto-scaling and next-gen inference engine for building multi-step autonomous agents.

入选理由：GPT-5.5、GPT-5.4 和 Codex 已在 Amazon Bedrock 上正式可用，支持自动扩展。

FeaturedTweet#OpenAI#Amazon Bedrock#GPT-5.5#AI Models#Cloud Services英文

$10K Cursor Credits 到期了，很想念它 😄

5月放开用 Cursor，差不多用了 $2K，大致整理了 Cursor 使用体验：
· 100% 时间都在用 Agent Window...

$10K Cursor Credits Expired, Miss It So Much 😄

meng shao(@shao__meng)6月2日400 字 (约 2 分钟)

After the $10K Cursor credit expired, users reported that Agent Window mode almost completely replaced traditional IDEs; GPT-5.5 and Composer 2.5 performed well in different scenarios, especially Composer 2.5 Fast mode which is fast and good at generating flowcharts, but default output is not Markdown and cannot be copied directly as Markdown, affecting efficiency.

入选理由：用户 100% 时间使用 Cursor 的 Agent Window，未打开传统 IDE 界面。

FeaturedTweet#Cursor#AI Editor#Agent Window#GPT-5.5#Composer 2.5中英混合

Lovable on How GPT-5.5 Unlocks Better Planning for Complex Builds

OpenAI6月1日260 字 (约 2 分钟)

GPT-5.5 significantly improves planning for complex builds: 31% better intent understanding, 22% fewer memory lapses, enabling non-coders to focus on goals, not code.

入选理由：GPT-5.5 规划阶段意图理解提升31%，减少重复交互需求。

FeaturedVideo#GPT-5.5#AI Planning#Lovable#No-code Development英文

用好 Coding Agent，重点是两头，尤其是开头的部分，如果一开始就走偏了后面怎么改都改不好。

AI HOT 精选5月28日722 字 (约 3 分钟)

使用 Coding Agent 开发新功能时，重点在于规划阶段，通过多个模型生成计划并选择最佳方案，确保后续开发顺利进行。

入选理由：开发新功能前先整理需求，使用多个 Agent 生成计划。

FeaturedArticle#Coding Agent#开发流程#AI 模型中文

Major upgrade to GPT-Rosalind, with much better intelligence for drug discovery, analysis, design, a...

Major GPT-Rosalind Upgrade: Enhanced Agentic Intelligence for Drug Discovery

Greg Brockman(@gdb)6月5日104 字 (约 1 分钟)

GPT-Rosalind's major upgrade integrates GPT-5.5's agentic coding and tool-use capabilities, significantly boosting enterprise-grade AI efficacy in drug discovery, analysis, and experimental workflows.

入选理由：GPT-Rosalind集成GPT-5.5的Agentic Coding能力，支持自动化药物研发代码生成与调试。

FeaturedTweet#GPT-Rosalind#AI Drug Discovery#GPT-5.5#Agentic Coding英文

我现在就是在 NAS 上部署一个 Hermes Studio，通过 FRP 做内网穿透，方便随时在浏览器和手机上调用。电脑本机再安装官方的 Hermes Desktop 作为 Agent 使用。默认模...

I am currently deploying a Hermes Studio on my NAS and using FRP for internal network penetration to access it anytime via browser or mobile phone.

Geek(@geekbb)6月8日336 字 (约 2 分钟)

Deploying Hermes Studio on NAS and combining it with FRP for internal network penetration, using multiple AI models to improve work efficiency.

入选理由：在 NAS 上部署 Hermes Studio 可实现远程访问。

FeaturedTweet#NAS#FRP#Hermes Studio#AI Models中文

OpenAI Recruits Tsinghua's 'Young Talent'! 12-Year-Old University Student, Harvard’s Youngest Tenured Professor

量子位6月2日1968 字 (约 8 分钟)

Yin Xi joins OpenAI on sabbatical to advance AI-theoretical physics research, claiming AI can replicate human intelligence limits and accelerate science by 100x.

入选理由：尹希12岁入中科大少年班，31岁成哈佛最年轻华人正教授，现以学术休假身份加入OpenAI。

FeaturedArticle#OpenAI#AI for Science#Theoretical Physics中文

TL;DR: Fable 5 isn’t the right-sized model for every task, but when quality and depth matter (revie...

Augment Code(@augmentcode)6月10日105 字 (约 1 分钟)

Fable 5模型在特定任务中表现优异，但并非所有场景都适用。

入选理由：Fable 5在需要高质量和深度的任务中表现突出。

FeaturedTweet#Fable 5#模型#AI#GPT#Opus 4.8英文

SWEbench is Done.

Matthew Berman6月2日212 字 (约 1 分钟)

The article questions the credibility of the SWEbench benchmark, noting that GPT-5.5 significantly outperforms Claude Opus 4.7 in DeepSuite (70% vs 54%), but SWEbench results show the opposite, suggesting the benchmark may be invalid.

入选理由：SWEbench测试结果被质疑，GPT-5.5在DeepSuite中得分为70%，显著高于Claude Opus 4.7的54%。

FeaturedVideo#SWEbench#DeepSuite#GPT-5.5#Claude Opus#AI Evaluation英文

[AINews] Founders and Forward Deployed Engineers

Latent Space6月1日1866 字 (约 8 分钟)

Anthropic released Claude Opus 4.8, showing incremental but not dominant gains across benchmarks—especially regressing on document parsing fidelity. Platform updates like mid-conversation system instructions improve engineering usability, yet API pricing remains a major pain point. Hugging Face also exposed a subtle RL training bug where re-tokenization breaks gradient flow in multi-turn tool-use loops.

入选理由：Claude Opus 4.8 在 CursorBench 上效率更高，但相比 4.7 仅小幅提升且在内容忠实性/图表解析上出现退步

FeaturedArticle#Anthropic#RL#Agent#API#Benchmark英文

Recently, DeepSeek-V4 Pro feels really good—especially because it’s cheap!

Viking(@vikingmute)5月31日174 字 (约 1 分钟)

DeepSeek-V4 Pro is praised for cost-effectiveness in small tasks like code review and writing, replacing expensive Qwen-Max; current primary model ranking: GPT-5.5 > Claude 4.7 > DeepSeek-V4 Pro.

入选理由：DeepSeek-V4 Pro在小任务（如review、写作）中表现良好且价格显著低于Qwen-Max

FeaturedTweet#DeepSeek#Qwen#LLM Selection#Cost Optimization中英混合

DeepSWE 关于 Opus 4.8 的评分来了，强于 4.7 ，而且成本更低，效率更高，但是仍然落后 GPT5.5 很多，我还没有深度使用。甚至我还在用 4.6，没别的原因，就是便宜。

而且我现...

DeepSWE’s Score on Opus 4.8 Is Out: Stronger Than 4.7, Lower Cost, Higher Efficiency — But Still Far Behind GPT-5.5. I Haven’t Used It Deeply Yet. I’m Still Using 4.6 Just Because It’s Cheaper.

Viking(@vikingmute)6月1日366 字 (约 2 分钟)

DeepSWE’s evaluation shows Opus 4.8 outperforms 4.7 in performance, cost, and efficiency, yet still lags far behind GPT-5.5; the author continues using cheaper 4.6 without deep testing of 4.8 or 5.5, and expresses skepticism toward benchmarks, preferring real user feedback from social media.

入选理由：Opus 4.8 性能强于 4.7，同时具备更低推理成本与更高效率，但未达 GPT-5.5 水平。

FeaturedTweet#Large Language Model#Benchmark#Opus#GPT-5.5#Cost-Efficiency中文

SWEbench is done.

Matthew Berman6月2日212 字 (约 1 分钟)

SWEbench benchmark is invalid as GPT 5.5 scores 70% on Deep Suite versus Opus 4.7's 54%, showing opposite trends in SWEbench, indicating unreliability.

入选理由：GPT 5.5 achieves 70% accuracy on Deep Suite, significantly outperforming Opus 4.7 at 54%.

FeaturedVideo#SWEbench#Deep Suite#GPT#Opus#Gemini英文

Codex Windows端上线Computer Use｜Copilot正式转token计费，GPT5.5涨价...

Codex Windows App Launches Computer Use | Copilot Switches to Token Billing, GPT-5.5 Price Increase

夕小瑶科技说6月2日73 字 (约 1 分钟)

Codex Windows app launches Computer Use feature, Copilot switches to token billing, GPT-5.5 price increase.

入选理由：Codex Windows端上线Computer Use功能

FeaturedArticle#Codex#Copilot#GPT中文

11 is an even row window according to GPT 5.5 thinking.

Suhail(@Suhail)5月31日50 字 (约 1 分钟)

The fictional GPT-5.5 incorrectly classifies the number 11 as an 'even row window', revealing severe flaws in basic math and terminology understanding.

入选理由：GPT-5.5被指称将11误判为‘even row window’，实为对‘even’与‘row/window’等术语的语义混淆。

FeaturedTweet#AI Hallucination#LLM#Math Literacy英文

跨材料问答 · GPT-5.5

回答基于：GPT-5.5 相关 30 条材料