Claude Opus 4.8 is here. Is it as good as they say?

TL;DR · AI Summary
Opus 4.8 scores 69.2% on Sweet Bench Pro—~5 pts above Opus 4.7, ~10 above GPT-4.5—but real-world coding reveals persistent ‘last 10%’ failures and hallucinations; pricing is steep at $5/k input tokens.
Key Takeaways
- Scores 69.2% on Sweet Bench Pro, +5 pts vs Opus 4.7, +10 vs GPT-4.5, +15 vs Gemi
- Pricing: $5 per 1k input tokens, $25 per 1M output tokens—significantly higher t
- Excels at greenfield prototyping and one-shot features, but struggles with edge
Outline
Jump quickly between sections.
Anthropic releases Opus 4.8 as a step-change model for agent workflows, emphasizing honesty, long-horizon autonomy, and enterprise readiness.
Achieves 69.2% on Sweet Bench Pro—nearly 5 points above Opus 4.7—and costs $5 per 1k input tokens and $25 per 1M output tokens.
Shines in greenfield prototyping and single-feature execution but consistently fails on the ‘last 10%’, edge cases, and hallucination control in legacy code.
Does not outperform Opus 4.7 on data-heavy strategy or roadmap tasks; author still prefers 4.7 for reliability and lower hallucination rates.
Includes dynamic workflows with parallel subagents and effort control; optimal use requires structured prompting and staged validation.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- Claude Opus 4.8 实测评估
- 技术指标
- Sweet Bench Pro: 69.2%
- 定价: $5/k input, $25/M output
- 目标场景: Agent、长周期任务
- 实测表现
- 优势: 绿field原型、单次功能、执行速度
- 短板: 最后10%问题、边缘case、幻觉
- 存量代码适配差于Opus 4.7
- 配套能力
- 动态工作流(并行子Agent)
- Claude.ai/Cowork中的努力控制
- 推荐提示策略:分阶段验证+结构化指令
Highlights
Key sentences worth saving and sharing.
Sweet Bench Pro score of 69.2%—nearly 5 points above Opus 4.7, 10 above GPT-4.5, and 15 above Gemini 3.1—is among the highest publicly reported.
Pricing at $5/k input tokens and $25/M output tokens makes it costly for high-volume usage, especially compared to alternatives.
In live coding tests, Opus 4.8 rapidly builds prototypes but repeatedly fails on error handling, boundary conditions, and documentation consistency—the ‘last 10%’ problem.
The author explicitly recommends Opus 4.7 over 4.8 for data-intensive strategy and roadmap work due to its superior stability and lower hallucination rate.
Claude Opus 4.8 is here. Is it as good as they say?
Playback speed
1×
Subtitles
Share post
Share post at current time
Share from 0:00
0:00
/
Transcript
0:04
Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today we have a very special mini episode because Anthropic just dropped Opus 4.8, their latest state-of-the-art coding model.
0:20
And I got a few hours of early access and I'm here to share my very early thoughts about where this model is intended to perform well, where it did a great job and totally impressed me, and where there's still a little bit further to go. Let's get to it. As you can tell,
0:34
I am not in my regular How I AI studio, and that's because I am so excited to give you my early thoughts on Opus 4.8 and couldn't wait between meetings to share what I thought. So to get started, I want to talk about what this model is, what Impropic has told us about its benchmarks, performance,
0:50
and what it's good at. so anthropic is shipping opus 4.8 it is supposed to be their step change model for agents and there's a couple things they've called out that this model does particularly well it's supposed to be more honest a less designed flop longer horizon autonomy on long-running tasks and enterprise ready so it means it follows
1:12
its instructions and they're saying that sweet bench pro they're hitting 69.2 percent which is almost five points higher than opus 4.7 almost 10 points higher than GPT 5.5 and 15 points higher than Gemini 3.1 now this model is not cheap it's $5 per input tokens and $25 per million output tokens and
Claude Opus 4.8 is here. Is it as good as they say?
🎙️My first impressions of Opus 4.8—where it excels and where it falls short
May 28, 2026
Transcript
I got a few hours of early-access testing with Anthropic’s newly released model Opus 4.8. I walk through real coding, design, and strategy tasks across Claude Code and Claude Cowork, and give you my unfiltered view on what impressed me and what didn’t.
Listen or watch on [YouTube](https://youtu.be/h0gZf1hL4D4), [Spotify](https://open.spotify.com/show/4aRP2XSavdtrLG5FZoonOK), or [Apple Podcasts](https://podcasts.apple.com/us/podcast/how-i-ai/id1809663079)
What you’ll learn:
- Where Opus 4.8 excels: greenfield prototypes, one-shot features, and fast execution
- Where it struggles: the last 10%, edge cases in existing codebases, and hallucinations
- How Opus 4.8 compares to Opus 4.7 on business strategy work
- Why I’m still reaching for Opus 4.7 on data-heavy strategy and roadmap work
- The new features shipping alongside the model: dynamic workflows with parallel subagents and effort control in Claude.ai and Cowork
- The prompting and harness strategy I’d use to get the most out of it
- * *
In this episode, we cover:
(00:00) Introduction to Opus 4.8
(00:44) Benchmark performance and pricing
(01:53) First coding test: Building a prototyping tool
(03:00) Where it failed: The last 10% problem
(03:27) The hallucination problem
(04:23) Testing Opus 4.8 on existing codebases
(05:24) The ambition test: Building games for a 9-year-old
(07:03) Business strategy test: 4.7 vs 4.8
(08:23) The roadmap test
(09:17) Final verdict
References:
• System Card: Claude Opus 4.8: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf
• Introducing Claude Opus 4.8 on X:
Where to find Claire Vo:
ChatPRD: https://www.chatprd.ai/
Website: https://clairevo.com/
LinkedIn: https://www.linkedin.com/in/clairevo/
Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email jordan@penname.co.
#### Discussion about this video
Comments Restacks

How I AI
How I AI, hosted by Claire Vo, is for anyone wondering how to actually use these magical new tools to improve the quality and efficiency of their work. In each episode, guests will share a specific, practical, and impactful way they’ve learned to use AI in their work or life. Expect 30-minute episodes, live screen sharing, and tips/tricks/workflows you can copy immediately. If you want to demystify AI and learn the skills you need to thrive in this new world, this podcast is for you.
How I AI, hosted by Claire Vo, is for anyone wondering how to actually use these magical new tools to improve the quality and efficiency of their work. In each episode, guests will share a specific, practical, and impactful way they’ve learned to use AI in their work or life. Expect 30-minute episodes, live screen sharing, and tips/tricks/workflows you can copy immediately. If you want to demystify AI and learn the skills you need to thrive in this new world, this podcast is for you.
Listen on
Substack App
Apple Podcasts
Spotify
YouTube
Overcast
Pocket Casts
RSS Feed
Appears in episode
Writes Claire’s SubstackSubscribe
Recent Episodes

The Codex feature that works while you sleep
May 27•Claire Vo

How the engineer behind Claude Cowork actually uses Claude | Felix Rieseberg (Anthropic)
May 25•Claire Vo

What launched at Google I/O 2026 (30-minute day 1 recap)
May 20•Claire Vo

HTML is the new Markdown: How Anthropic engineers are building with Claude Code | Thariq Shihipar
May 18•Claire Vo

Spec-driven development: The AI engineering workflow at Notion | Ryan Nystrom
May 11•Claire Vo

Code with Claude: The 5 biggest updates explained
May 7•Claire Vo

May 6•Claire Vo