Claude Opus 4.8已发布：真如宣传般强大吗？

Lenny's Newsletter

Lenny's Newsletter2026年5月28日

Claude Opus 4.8 is here. Is it as good as they say?

8.7Score

TL;DR · AI Summary

Opus 4.8 scores 69.2% on Sweet Bench Pro—~5 pts above Opus 4.7, ~10 above GPT-4.5—but real-world coding reveals persistent ‘last 10%’ failures and hallucinations; pricing is steep at $5/k input tokens.

Key Takeaways

Scores 69.2% on Sweet Bench Pro, +5 pts vs Opus 4.7, +10 vs GPT-4.5, +15 vs Gemi
Pricing: $5 per 1k input tokens, $25 per 1M output tokens—significantly higher t
Excels at greenfield prototyping and one-shot features, but struggles with edge

Outline

Jump quickly between sections.

§Opus 4.8 Launch Context & Positioning
Anthropic releases Opus 4.8 as a step-change model for agent workflows, emphasizing honesty, long-horizon autonomy, and enterprise readiness.
·Benchmark Performance & Pricing
Achieves 69.2% on Sweet Bench Pro—nearly 5 points above Opus 4.7—and costs $5 per 1k input tokens and $25 per 1M output tokens.
·Real-World Testing: Strengths and Weaknesses
Shines in greenfield prototyping and single-feature execution but consistently fails on the ‘last 10%’, edge cases, and hallucination control in legacy code.
·Business Strategy Comparison & Usage Guidance
Does not outperform Opus 4.7 on data-heavy strategy or roadmap tasks; author still prefers 4.7 for reliability and lower hallucination rates.
·New Features & Prompting Strategies
Includes dynamic workflows with parallel subagents and effort control; optimal use requires structured prompting and staged validation.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

Claude Opus 4.8 实测评估
- 技术指标
  - Sweet Bench Pro: 69.2%
  - 定价: $5/k input, $25/M output
  - 目标场景: Agent、长周期任务
- 实测表现
  - 优势: 绿field原型、单次功能、执行速度
  - 短板: 最后10%问题、边缘case、幻觉
  - 存量代码适配差于Opus 4.7
- 配套能力
  - 动态工作流（并行子Agent）
  - Claude.ai/Cowork中的努力控制
  - 推荐提示策略：分阶段验证+结构化指令

Highlights

Key sentences worth saving and sharing.

Sweet Bench Pro score of 69.2%—nearly 5 points above Opus 4.7, 10 above GPT-4.5, and 15 above Gemini 3.1—is among the highest publicly reported.
— 00:44
⬇︎ 下载 PNG 𝕏 分享到 X
Pricing at $5/k input tokens and $25/M output tokens makes it costly for high-volume usage, especially compared to alternatives.
— 00:44
⬇︎ 下载 PNG 𝕏 分享到 X
In live coding tests, Opus 4.8 rapidly builds prototypes but repeatedly fails on error handling, boundary conditions, and documentation consistency—the ‘last 10%’ problem.
— 03:00–03:27
⬇︎ 下载 PNG 𝕏 分享到 X
The author explicitly recommends Opus 4.7 over 4.8 for data-intensive strategy and roadmap work due to its superior stability and lower hallucination rate.
— 08:23
⬇︎ 下载 PNG 𝕏 分享到 X

#Claude#LLM#Anthropic#AI coding#benchmark

Open original article

Claude Opus 4.8 is here. Is it as good as they say?

Video 4

Playback speed

1×

Subtitles

Share post

Share post at current time

Share from 0:00