GPT-5.5 + Codex! It's exploding — the ultimate combo.
TL;DR · AI Summary
GPT-5.5 excels in engineering scenarios with higher cost-effectiveness than Claude Opus 4.6, showing significant improvements in terminal reasoning, long-context handling, and hallucination reduction.
Key Takeaways
- GPT-5.5 scores 82.7% on Terminal-Bench 2.0, up +7.6% from GPT-5.4.
- Using GPT-5.5 for planning + DeepSeek V4-Pro for implementation saves tokens and
- Multi-model systems should use DB as the source of truth, encrypt API keys, and
Outline
Jump quickly between sections.
At same price point, GPT-5.5 outperforms Claude Opus 4.6 in engineering use cases due to better API ecosystem and stability.
GPT-5.5 improves by +7.6% on Terminal-Bench 2.0 and +37.4% on MRCR v2, indicating strong long-context reasoning capability.
GPT-5.5 generates optimization suggestions for a stock analysis project, then DeepSeek V4-Pro implements them—validating the 'cost-efficient model division' strategy.
DeepSeek V4-Pro scans code issues, then GPT-5.5 reviews and fixes each one—ensuring both coverage and precision.
Use DB as primary config source, store API keys encrypted, and treat YAML as seed only—not final authority.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- GPT-5.5 工程实战指南
- 性能对比
- Terminal-Bench 2.0: +7.6%
- MRCR v2: +37.4%
- 幻觉率下降 60%
- 实战方法论
- 贵模型出方案
- 便宜模型实现
- 审计与修复分离
- 部署最佳实践
- DB 为事实源
- API Key 加密存储
- YAML 仅作种子
Highlights
Key sentences worth saving and sharing.
GPT-5.5 improved from 36.6% to 74.0% on MRCR v2 (512K–1M tokens), nearly doubling long-context reasoning performance.
Let high-cost models design, low-cost ones execute—it’s the most effective AI cost-control strategy today.
Redis deserialization vulnerability could lead to arbitrary code execution—if exploited, it must be fixed immediately.
Hi, I'm Guide. Let me break down the numbers first: ChatGPT Plus costs $20/month and now directly gives you access to GPT-5.5. What about Claude Opus 4.6/4.7? The Pro plan is also $20/month.
Same price, GPT-5.5 lasts longer and delivers capabilities that are on par or even stronger in some scenarios.
In my previous model evaluations, I said: GPT-5.5 and Claude Opus 4.6 are tied as top-tier models—no clear #1. Both occasionally fail, just in different ways. But when adding "cost-effectiveness" into the mix, GPT-5.5’s advantage becomes obvious—it’s the most stable for engineering tasks, has the broadest API ecosystem, and handles large-scale work efficiently.
GPT-5.5 was released on April 23, 2026, and users with Plus, Pro, Business, or Enterprise plans could immediately use it in both ChatGPT and Codex. Codex even offers a 400K context window—available to Plus users (subscription method at the end of this article, not an ad).
I started using it right after launch and have since tackled several real-world engineering problems with it. Honestly, it’s *absolutely* amazing—truly game-changing!
This article is close to a practical case study, featuring three real project examples. By reading this, you’ll understand:
- The actual capability level of GPT-5.5: Using benchmark data to show where it stands.
- How the “expensive model designs, cheap model executes” approach works in practice: Two cases validate this strategy—GPT-5.5 drafts solutions, V4-Pro implements them; then V4-Pro scans issues, GPT-5.5 fixes them.
- How to design a multi-model configuration center more reasonably: DB as the source of truth, YAML only as startup seed, API keys stored encrypted.
- Why RAG must separate Chat Provider from Embedding Provider, plus real-world vector dimension pitfalls.
- How to best pair GPT-5.5 + Codex: Action-first mindset, context collection, AGENTS.md, and other practical methods.
- How to actually calculate cost-effectiveness: $20/month vs $200/month—what’s the real difference?
What’s the actual performance level of GPT-5.5?
First, let’s look at the data. OpenAI published a set of benchmark comparisons upon release—I picked a few most relevant to engineering scenarios:
| Metric | GPT-5.4 | GPT-5.5 | Improvement | | --- | --- | --- | --- | | Terminal-Bench 2.0 | 75.1% | 82.7% | +7.6 percentage points | | SWE-Bench Pro | 57.7% | 58.6% | +0.9 percentage points | | MRCR v2 (512K–1M tokens) | 36.6% | 74.0% | +37.4 percentage points | | Hallucination Rate | Baseline | Reduced by 60% | Compared to GPT-5.4 |
Key takeaways:
- Massive improvement in long-context reasoning: MRCR v2 jumped from 36.6% to 74.0%, nearly doubling. This means GPT-5.5 maintains high-quality reasoning over larger codebases within its extended context window.
- Continues leading in terminal/code environments: 82.7% on Terminal-Bench 2.0 ranks among the best across all models currently available.
- Significant hallucination reduction: 60% fewer hallucinations mean fewer incorrect-looking but actually wrong code snippets during development.
But benchmarks are just benchmarks—how does it perform in real engineering scenarios? Let’s dive into practice.
Case Study 1: Have GPT-5.5 propose a solution, let DeepSeek V4-Pro implement it
I wanted to optimize my [multi-agent stock analysis project](https://link.juejin.cn/?target=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzg2OTA0Njk0OA%3D%3D%26mid%3D2247553847%26idx%3D1%26sn%3D608624100c658af1df2a79338c0c46be%26scene%3D21%23wechat_redirect "https://mp.weixin.qq.com/s?__biz=Mzg2OTA0Njk0OA==&mid=2247553847&idx=1&sn=608624100c658af1df2a79338c0c46be&scene=21#wechat_redirect"), but I hadn’t opened it in nearly a month and had no ideas. So I simply asked:
Please refer to mature open-source AI stock analysis projects and suggest further optimizations for this project—for example, implementing stock alert notifications.GPT-5.5 didn’t rush into coding—it first reviewed several active similar open-source projects and provided a prioritized list based on current project structure:
It divided suggestions into five priority levels—the highest being enhancing the alert system. The project already had basic alert logic (priceAlerts, technicalAlerts stored in ConcurrentHashMap), but they were volatile—lost on restart—and lacked corresponding Controller/API/UI components.
Since I also wanted to test DeepSeek V4-Pro, I had GPT-5.5 draft the full implementation plan, then handed it off to V4-Pro for execution.
This is exactly the approach I always advocate: let expensive models handle planning and decisions, cheaper ones do the heavy lifting.
However, if your token budget is unlimited, there’s no need for such cost-saving strategies.
V4-Pro implemented well—even though it didn’t get it right the first time, it fixed the issue promptly after receiving error feedback.
Here’s the new alert functionality:
We successfully received alerts via Feishu that same day:
Case Study 2: V4-Pro audits, GPT-5.5 reviews and fixes
Still working on the same [multi-agent stock analysis project](https://link.juejin.cn/?target=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzg2OTA0Njk0OA%3D%3D%26mid%3D2247553847%26idx%3D1%26sn%3D608624100c658af1df2a79338c0c46be%26scene%3D21%23wechat_redirect "https://mp.weixin.qq.com/s?__biz=Mzg2OTA0Njk0OA==&mid=2247553847&idx=1&sn=608624100c658af1df2a79338c0c46be&scene=21#wechat_redirect"). All features work, but due to tight deadlines, code quality wasn't thoroughly reviewed earlier.
My strategy this time: split auditing and fixing between two models—cheap one scans, expensive one repairs.
Specifically, I used Claude Code to deploy multiple agents from DeepSeek V4-Pro simultaneously, each focusing on security, correctness, and code quality, then aggregated results into a single document.
After scanning, V4-Pro produced a prioritized issue list—the top five were:
- API Key stored in plaintext — Encryption already implemented but not integrated
- System management APIs lack permission control — Regular users can modify LLM settings
- Redis deserialization vulnerability —
activateDefaultTypingallows arbitrary class instantiation - Hardcoded third-party API Key — Real Bocha key committed in code
- Functional bug — “Re-analyze” button on History page fails due to unhandled route parameters
I reviewed each item—conclusions were mostly accurate. Security issues especially cannot wait—the Redis deserialization flaw could lead to severe consequences if exploited.
Next, I fed V4-Pro’s audit report directly into GPT-5.5 for review and repair.
Why not just let V4-Pro fix everything? Because identifying issues and fixing them require different skill sets. Auditing demands coverage—better to report too much than miss anything. Fixing requires precision—changes must be targeted without introducing new side effects. Each model excels at its own task—this division is far more reliable than having one model do everything.
This is something I’ve emphasized repeatedly in my article “[Reviewing Every AI Programming Model I’ve Used, Starting from the Ground Up](https://link.juejin.cn/?target=https%3A%2F%2Fmp.weixin.qq.com%2Fs%2F650VrMmcsCjv6iuDOPLRWg "https://mp.weixin.qq.com/s/650VrMmcsCjv6iuDOPLRWg")”—for truly challenging engineering problems, GPT-5.5 and Claude Opus 4.6 remain the most dependable flagship models.
GPT-5.5 reviewed and fixed each issue clearly, making the entire process smooth and efficient.
Back to this case: V4-Pro performed solidly—but what matters more is the cost savings: cheap model scans, expensive model fixes. Running a full project scan with V4-Pro during its promotional period costs almost nothing. Doing the same job with GPT-5.5 or Claude Opus 4.6 would increase costs by at least two orders of magnitude.
Case Study 3: Multi-model Configuration Refactor for Interview Platform
The [AI-powered interview assistant platform + RAG knowledge base (open-source)](https://link.juejin.cn/?target=https%3A%2F%2Fmp.weixin.qq.com%2Fs%2F0sSJNbTR4Od-P40N0uw6PQ "https://mp.weixin.qq.com/s/0sSJNbTR4Od-P40N0uw6PQ") uses the following tech stack:
- Backend: Spring Boot 4.0, Java 21, Spring AI 2.0, JPA, PostgreSQL, pgvector, Redis Stream
- Frontend: React 18, TypeScript, Vite, TailwindCSS 4
- AI Features: Resume analysis, mock interviews, voice interviews, knowledge base RAG, multi-provider model configuration
The project already includes a model configuration UI allowing users to select providers like DashScope, DeepSeek, GLM, Kimi, etc.
The above image already shows the optimized result—the initial version from the PR by teammate. The overall implementation had several clear issues:
- Configuration was mainly written in YAML or
.env, not stored in a database. - Default chat model and default vector model were tightly coupled.
LlmEmbeddingConfigcreated a fixedEmbeddingModelBean at startup; switching models at runtime did not actually affect the vectorization pipeline.- Although the frontend had an "Embedding Model" input box, it didn’t clearly distinguish between "chat model" and "vector model".
My first prompt to GPT-5.5 was roughly:
The current model configuration interface doesn't persist to a database or cache. If I restart the project, all settings are lost. Another issue is that DeepSeek and Kimi don’t support Embedding, but the project has RAG knowledge base functionality—this needs optimization.
It didn’t jump straight into code changes—it first read the project structure. This is crucial. In real projects, the worst thing is when the model builds its own architecture based on imagination.
GPT-5.5 first located several key files before proposing solutions, modifying code, and running tests.
Configuration Persistence
The original implementation was more like a “development environment temporary solution”: change the config, write it back to YAML or .env. It seemed usable locally, but problems became obvious once deployed in production:
- Classpath configs inside JARs are not writable.
- With multiple instances, each writes independently.
ReentrantReadWriteLockonly manages single JVM—not clusters.- Config changes don’t naturally align with Spring Boot’s configuration binding lifecycle.
GPT-5.5 proposed this direction:
- Store Provider configurations in PostgreSQL.
- Use Redis only as cache—not the sole source of persistence.
- Keep YAML as seed configuration for startup.
- At runtime, use DB as the authoritative source for all configurations.
Eventually, two tables were implemented:
llm_provider_config
llm_global_settingMidway through, I specifically discussed whether API keys should be stored plaintext in the database.
GLM suggested storing them plaintext in the first version, reasoning that .env also stores them plaintext. I disagreed—plaintext in .env and plaintext in the database are entirely different risk levels.
The database can be backed up, queried via SQL, accessed by read-only accounts. Once API keys are stored plaintext, the exposure surface expands significantly.
So this time, I didn’t take shortcuts—the final implementation used AES-256-GCM encryption at the application layer:
I’m quite satisfied with this part. It’s not just about CRUD—it continues pushing forward under security constraints.
| Item | Before | After | | --- | --- | --- | | Configuration Source | YAML / .env | PostgreSQL | | Startup Configuration | Directly bound to config file | YAML as DB seed | | API Key Storage | Plaintext in environment variables | Encrypted storage in DB | | Default Models | One default provider | Two defaults: Chat & Embedding | | Runtime Refresh | Relies on single-JVM lock | Registry cache invalidation triggers rebuild |
Chat Provider and Embedding Provider Must Be Separated
The second issue is more critical: DeepSeek and Kimi can serve as chat models, but they cannot perform embeddings.
If the project only involves regular conversations, this isn’t a problem. But RAG requires document vectorization.
Mainstream Chinese vendors’ embedding support looks like this:
| Vendor | Supports Embedding | Common Models | | --- | --- | --- | | Alibaba Tongyi | Yes | text-embedding-v3 | | Zhipu GLM | Yes | embedding-3 | | Baidu Wenxin | Yes | Embedding-V1 | | MiniMax | Yes | embo-01 | | DeepSeek | No | - | | Kimi / Moonshot | No | - |
The old design was:
default-provider -> ChatClient
default-provider -> EmbeddingModelThis is dangerous.
For example, if you switch the default model to DeepSeek:
- Resume analysis and interview question generation can go through DeepSeek.
- But the knowledge base vectorization will have no available embedding model.
More subtly, LlmEmbeddingConfig creates a fixed Bean at startup. If you switch the default provider at runtime, the ChatClient may change—but the EmbeddingModel might not follow suit.
GPT-5.5’s fix:
- Separate default chat provider from default embedding provider.
LlmProviderRegistrymanages bothChatClientandEmbeddingModelcaches.LlmEmbeddingConfigno longer creates a fixed model—it creates a delegate-styleEmbeddingModel.- Each time vectorization occurs, it fetches the current default embedding provider from the registry.
The core idea looks like this:
@Bean
public EmbeddingModel embeddingModel(LlmProviderRegistry registry) {
return new EmbeddingModel() {
@Override
public EmbeddingResponse call(EmbeddingRequest request) {
return registry.getDefaultEmbeddingModel().call(request);
}
@Override
public float[] embed(Document document) {
return registry.getDefaultEmbeddingModel().embed(document);
}
};
}This change is small but meaningful.
It bypasses Spring Bean lifecycle issues: VectorStore still injects an EmbeddingModel Bean, but behind the scenes, it dynamically delegates to the current default embedding model.
This is truly suitable for “runtime model configuration switching.”
GLM embedding-3 Dimension Pitfall
Then we hit a more realistic pitfall.
I set GLM as the default vector service and filled in the model name as embedding-3, but async vectorization failed:
ERROR: expected 1024 dimensions, not 2048This wasn’t a wrong model name.
The problem was:
- Our pgvector table expects 1024 dimensions.
- GLM’s
embedding-3returns 2048 dimensions by default. - If Spring AI doesn’t explicitly specify
dimensions, it uses the server’s default dimension.
This kind of issue is very subtle—model name is correct, provider is correct, API key works—but it fails during database write.
GPT-5.5 fixed it by including “vector dimension” in the provider configuration.
Backend added:
embedding_dimensionsIn the config, we added for GLM and DashScope:
embedding-dimensions: 1024When creating OpenAiEmbeddingOptions, explicitly pass:
OpenAiEmbeddingOptions options = OpenAiEmbeddingOptions.builder()
.model(config.embeddingModel())
.dimensions(resolveEmbeddingDimensions(config.embeddingDimensions()))
.build();Frontend also added an “embedding dimension” input field, defaulting to 1024.
I think this point deserves special mention.
In RAG systems, embedding models aren’t just about “whether the model name works”—you must also consider:
- Vector dimensions
- Distance function
- pgvector table structure
- Whether existing data needs re-vectorization
- Whether different knowledge bases can mix dimensions
This time, we didn’t immediately support multi-dimensional coexistence—we locked the system to 1024 dimensions. This decision makes sense because the current vector_store.embedding table was already fixed at 1024 dimensions.
First version ensures usability and stability. Later, if multi-dimension support is needed, we’ll design separate tables or isolate per-knowledge-base dimensions.
How Did GPT-5.5 Perform?
Overall, I’m very satisfied with GPT-5.5’s performance this time.
Its most obvious strength: it doesn’t just fix one line of error—it traces upward along system boundaries.
For example, the initial “configuration loss after restart” issue—it didn’t just suggest fixing YAML writing. Instead, it recognized:
- YAML isn’t a good source for runtime configuration.
- DB is better suited as the provider configuration center.
- Redis should only be used for caching.
- Registry must support runtime refresh.
- EmbeddingModel must be registered like ChatClient.
Another example: GLM embedding-3 dimension issue—initially it didn’t directly think of this in the plan. But upon seeing the real error, it quickly corrected the assumption from “model unusable” to “dimension mismatch,” and included dimensions in the configuration chain.
This is the most important skill in practice: quickly adjust assumptions based on logs.
However, GPT-5.5 isn’t perfect. Several points need human oversight this time:
- Provider latest model names shouldn’t rely on memory — e.g., DeepSeek’s model name must be verified via official docs or actual APIs. Guessing a seemingly new model name risks immediate 400 errors.
- RAG dimension issues require real execution to surface — paper plans often miss differences between pgvector dimensions and default embedding API dimensions.
- Tool-call compatibility shouldn’t assume optimism — OpenAI-compatible only means interface shape compatibility—not full tool-call detail compatibility.
- Security rules need contextual debugging —
PromptSanitizeralerts aren’t bad, but distinguish between user-triggered and internal format false positives.
How to Best Combine GPT-5.5 + Codex?
GPT-5.5 is now available in Codex cloud agents (Plus users can access it, context window 400K). This section shares several tips distilled from real-world practice—see my detailed guide [“Best Practices for OpenAI Codex”](https://link.juejin.cn/?target=https%3A%2F%2Fjavaguide.cn%2Fai%2Fai-coding%2Fcodex-best-practices.html "https://javaguide.cn/ai/ai-coding/codex-best-practices.html") for more.
Action-Oriented: Let the Model Just Work
Codex prompt design follows a core principle: Action Bias. Good prompts should guide the model to deliver working code—not end replies with questions.
Specifically:
- Clearly tell the model: “deliver working code, not just plans.”
- The model should make reasonable assumptions and proceed.
- Only ask users when truly blocked (missing key info or contradictory constraints).
Bad Example: Prompt asks the model to “list a plan first, then execute after confirmation.” This causes the model to stop before finishing work—severely reducing efficiency.
Good Example: Prompt says: “Start working immediately after receiving the task, reasonably assume unclear parts, show results afterward. If blocked and unable to decide, then ask the user.”
Context Gathering: Plan First, Then Parallelize
Before modifying code, Codex should fully understand the codebase. Prompts should explicitly require:
- Batch Reading: Before calling tools, determine which files are needed, then read them in parallel.
- Avoid Serial Exploration: Don’t read one file at a time.
- Search Before Adding: Before adding new implementations, search if similar features exist.
This “plan first, then parallelize” strategy significantly reduces round trips. In Practice #3, GPT-5.5 first read the project structure, located key files, then gave suggestions—this is exactly how this strategy manifests naturally.
AGENTS.md: Inject Project Context into Codex
AGENTS.md serves a similar purpose to Claude Code's CLAUDE.md, both injecting project-level context and guidelines for AI. Codex automatically scans and injects the AGENTS.md file, with loading logic following a layered override principle:
| Level | Path | Scope | | --- | --- | --- | | Global | ~/.codex/AGENTS.md | Universal defaults for all projects | | Project | Repository root AGENTS.md | Project-level conventions | | Module | Subdirectory AGENTS.md | Module-specific rules |
It's recommended to place an AGENTS.md at the project root, covering at minimum: build commands, testing standards, code style conventions, and Git workflow practices.
Choosing the Right Safety Mode
Codex offers three safety modes:
| Mode | Description | Use Case | | --- | --- | --- | | Suggest | Can read files, but all write operations and commands require confirmation | Code review, learning | | Auto Edit | Automatically edits files, but command-line actions require confirmation | Daily development | | Full Auto | Fully automatic—both editing and commands are executed without confirmation | CI/CD, batch tasks |
The Guide recommends: Start with Suggest mode to build trust, then switch to Auto Edit for efficiency, and only consider Full Auto after that. Jumping straight into Full Auto risks failure without understanding what went wrong.
High-cost models for planning, low-cost models for execution
Both Practical Examples 1 and 2 follow this approach: GPT-5.5 proposes solutions, while V4-Pro executes them. In Codex, you can do the same—use GPT-5.5 for architectural decisions and complex troubleshooting, and delegate routine coding and code scanning to cheaper models.
This strategy yields clear cost advantages. Based on Codex’s pay-per-use pricing, the same project-level code scan task costs dozens of times more when using GPT-5.5 compared to V4-Pro.
Summary
At this point, the Guide wants to share key takeaways from this GPT-5.5 practical experience.
First, it truly handles medium-to-large-scale refactoring. But the condition is clear: you must provide real logs, real code, and real error messages. If you only give a vague prompt like "optimize the model configuration," it will likely return generic suggestions. However, if you feed it concrete issues such as "restarting config lost," "GLM embedding-3 dimension error in writing," or "DeepSeek voice path 400 error," it can trace through the engineering chain layer by layer until it finds the root cause.
Second, all three practical examples share a common methodology: leverage different models where they excel most. GPT-5.5 for designing solutions, V4-Pro for implementation; V4-Pro for detecting issues, GPT-5.5 for defining fixes and rewriting code. This isn’t a novel idea, but GPT-5.5 makes it easier to implement—it generates high-quality plans that cheap models execute reliably with minimal errors.
The Guide’s stance is clear: GPT-5.5 and Claude Opus 4.6/4.7 each have their own strengths. I use them together daily—but if you must choose one or your team has limited budget, GPT-5.5 + ChatGPT Plus is currently the best value-for-money option—no contest.