T
traeai
Sign in
返回首页
Lenny's Newsletter

Claude Opus 4.8 is here. Is it as good as they say?

8.7Score
Claude Opus 4.8 is here. Is it as good as they say?

TL;DR · AI Summary

Opus 4.8 scores 69.2% on Sweet Bench Pro—~5 pts above Opus 4.7, ~10 above GPT-4.5—but real-world coding reveals persistent ‘last 10%’ failures and hallucinations; pricing is steep at $5/k input tokens.

Key Takeaways

  • Scores 69.2% on Sweet Bench Pro, +5 pts vs Opus 4.7, +10 vs GPT-4.5, +15 vs Gemi
  • Pricing: $5 per 1k input tokens, $25 per 1M output tokens—significantly higher t
  • Excels at greenfield prototyping and one-shot features, but struggles with edge

Outline

Jump quickly between sections.

  1. §Opus 4.8 Launch Context & Positioning

    Anthropic releases Opus 4.8 as a step-change model for agent workflows, emphasizing honesty, long-horizon autonomy, and enterprise readiness.

  2. Achieves 69.2% on Sweet Bench Pro—nearly 5 points above Opus 4.7—and costs $5 per 1k input tokens and $25 per 1M output tokens.

  3. Shines in greenfield prototyping and single-feature execution but consistently fails on the ‘last 10%’, edge cases, and hallucination control in legacy code.

  4. Does not outperform Opus 4.7 on data-heavy strategy or roadmap tasks; author still prefers 4.7 for reliability and lower hallucination rates.

  5. Includes dynamic workflows with parallel subagents and effort control; optimal use requires structured prompting and staged validation.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • Claude Opus 4.8 实测评估
    • 技术指标
      • Sweet Bench Pro: 69.2%
      • 定价: $5/k input, $25/M output
      • 目标场景: Agent、长周期任务
    • 实测表现
      • 优势: 绿field原型、单次功能、执行速度
      • 短板: 最后10%问题、边缘case、幻觉
      • 存量代码适配差于Opus 4.7
    • 配套能力
      • 动态工作流(并行子Agent)
      • Claude.ai/Cowork中的努力控制
      • 推荐提示策略:分阶段验证+结构化指令

Highlights

Key sentences worth saving and sharing.

  • Sweet Bench Pro score of 69.2%—nearly 5 points above Opus 4.7, 10 above GPT-4.5, and 15 above Gemini 3.1—is among the highest publicly reported.

    00:44

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Pricing at $5/k input tokens and $25/M output tokens makes it costly for high-volume usage, especially compared to alternatives.

    00:44

    ⬇︎ 下载 PNG𝕏 分享到 X
  • In live coding tests, Opus 4.8 rapidly builds prototypes but repeatedly fails on error handling, boundary conditions, and documentation consistency—the ‘last 10%’ problem.

    03:00–03:27

    ⬇︎ 下载 PNG𝕏 分享到 X
  • The author explicitly recommends Opus 4.7 over 4.8 for data-intensive strategy and roadmap work due to its superior stability and lower hallucination rate.

    08:23

    ⬇︎ 下载 PNG𝕏 分享到 X
#Claude#LLM#Anthropic#AI coding#benchmark
Open original article

Claude Opus 4.8 is here. Is it as good as they say?

Video 4

Playback speed

Subtitles

Share post

Share post at current time

Share from 0:00

0:00

/

Transcript

0:04

Claire Vo

Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today we have a very special mini episode because Anthropic just dropped Opus 4.8, their latest state-of-the-art coding model.

0:20

And I got a few hours of early access and I'm here to share my very early thoughts about where this model is intended to perform well, where it did a great job and totally impressed me, and where there's still a little bit further to go. Let's get to it. As you can tell,

0:34

I am not in my regular How I AI studio, and that's because I am so excited to give you my early thoughts on Opus 4.8 and couldn't wait between meetings to share what I thought. So to get started, I want to talk about what this model is, what Impropic has told us about its benchmarks, performance,

0:50

and what it's good at. so anthropic is shipping opus 4.8 it is supposed to be their step change model for agents and there's a couple things they've called out that this model does particularly well it's supposed to be more honest a less designed flop longer horizon autonomy on long-running tasks and enterprise ready so it means it follows

1:12

its instructions and they're saying that sweet bench pro they're hitting 69.2 percent which is almost five points higher than opus 4.7 almost 10 points higher than GPT 5.5 and 15 points higher than Gemini 3.1 now this model is not cheap it's $5 per input tokens and $25 per million output tokens and

Claude Opus 4.8 is here. Is it as good as they say?

🎙️My first impressions of Opus 4.8—where it excels and where it falls short

Claire Vo

May 28, 2026

Transcript

Video 5

I got a few hours of early-access testing with Anthropic’s newly released model Opus 4.8. I walk through real coding, design, and strategy tasks across Claude Code and Claude Cowork, and give you my unfiltered view on what impressed me and what didn’t.

Listen or watch on [YouTube](https://youtu.be/h0gZf1hL4D4), [Spotify](https://open.spotify.com/show/4aRP2XSavdtrLG5FZoonOK), or [Apple Podcasts](https://podcasts.apple.com/us/podcast/how-i-ai/id1809663079)

What you’ll learn:

  1. Where Opus 4.8 excels: greenfield prototypes, one-shot features, and fast execution
  1. Where it struggles: the last 10%, edge cases in existing codebases, and hallucinations
  1. How Opus 4.8 compares to Opus 4.7 on business strategy work
  1. Why I’m still reaching for Opus 4.7 on data-heavy strategy and roadmap work
  1. The new features shipping alongside the model: dynamic workflows with parallel subagents and effort control in Claude.ai and Cowork
  1. The prompting and harness strategy I’d use to get the most out of it
  • * *

In this episode, we cover:

(00:00) Introduction to Opus 4.8

(00:44) Benchmark performance and pricing

(01:53) First coding test: Building a prototyping tool

(03:00) Where it failed: The last 10% problem

(03:27) The hallucination problem

(04:23) Testing Opus 4.8 on existing codebases

(05:24) The ambition test: Building games for a 9-year-old

(07:03) Business strategy test: 4.7 vs 4.8

(08:23) The roadmap test

(09:17) Final verdict

References:

• System Card: Claude Opus 4.8: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf

• Introducing Claude Opus 4.8 on X:

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email jordan@penname.co.

#### Discussion about this video

Comments Restacks

Image 8: User's avatar

How I AI

How I AI, hosted by Claire Vo, is for anyone wondering how to actually use these magical new tools to improve the quality and efficiency of their work. In each episode, guests will share a specific, practical, and impactful way they’ve learned to use AI in their work or life. Expect 30-minute episodes, live screen sharing, and tips/tricks/workflows you can copy immediately. If you want to demystify AI and learn the skills you need to thrive in this new world, this podcast is for you.

How I AI, hosted by Claire Vo, is for anyone wondering how to actually use these magical new tools to improve the quality and efficiency of their work. In each episode, guests will share a specific, practical, and impactful way they’ve learned to use AI in their work or life. Expect 30-minute episodes, live screen sharing, and tips/tricks/workflows you can copy immediately. If you want to demystify AI and learn the skills you need to thrive in this new world, this podcast is for you.

Listen on

Substack App

Apple Podcasts

Spotify

YouTube

Overcast

Pocket Casts

RSS Feed

Appears in episode

Claire Vo

Writes Claire’s SubstackSubscribe

Recent Episodes

Image 11

The Codex feature that works while you sleep

May 27•Claire Vo

Image 12

How the engineer behind Claude Cowork actually uses Claude | Felix Rieseberg (Anthropic)

May 25•Claire Vo

Image 13

What launched at Google I/O 2026 (30-minute day 1 recap)

May 20•Claire Vo

Image 14

HTML is the new Markdown: How Anthropic engineers are building with Claude Code | Thariq Shihipar

May 18•Claire Vo

Image 15

Spec-driven development: The AI engineering workflow at Notion | Ryan Nystrom

May 11•Claire Vo

Image 16

Code with Claude: The 5 biggest updates explained

May 7•Claire Vo

Image 17

Quests, token leaderboards, and a skills marketplace: The elite AI adoption playbook | John Kim (Sendbird)

May 6•Claire Vo

Ready for more?

AI may generate inaccurate information. Please verify important content.