AI Dev 26 x SF | Ara Khan：评估已失效，但仍必须用

DeepLearning.AI

DeepLearning.AIVideo2026年5月22日

AI Dev 26 x SF | Ara Khan: Evals Are Broken — Use Them Anyway

7.8Score

Watchable video resourceOpen original video

TL;DR · AI Summary

AI evals are fundamentally broken—over-reliance on objective metrics misleads—but they remain critical when built, interpreted, and embedded properly in agent workflows.

Key Takeaways

Current mainstream evals (e.g., Epoch AI, OpenAI benchmarks) suffer from 'false
Meta's new model scored benchmark-max yet disappointed users—showing evals are d
Proper eval use requires three steps: build (design), interpret (analyze), and e

Outline

Jump quickly between sections.

§Introduction: Why Evals Matter
Evals are among the most critical aspects of AI agent development, regardless of agent type or complexity—everyone should learn to build, interpret, and use them.
§Two Common Misconceptions About Evals
People wrongly assume evals provide objective truth, but near-identical scores often mask significant real-world performance differences.
·Misconception 1: Blind Trust in Benchmark Scores
For example, Sonnet 4.6 scoring 52 vs. nearby models appears precise, yet half an hour of real use reveals scores poorly reflect actual capability.
·Misconception 2: Evals Become a 'Gaming' Exercise
Many labs optimize solely for eval scores, leading models to diverge from user needs—as demonstrated by Meta's benchmark-max but user-disappointing model.
§How to Use Evals Correctly: A 3-Step Practice
Transform evals from static metrics into dynamic feedback by building, interpreting, and embedding them directly into agent flows for closed-loop improvement.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

AI Evals Are Broken — But Still Useful
- 问题根源
  - 虚假精确性：分数相近 ≠ 能力相当
  - eval 被异化为刷分游戏
- 正确使用路径
  - 构建 eval（Build）
  - 解读 eval（Interpret）
  - 嵌入 agent flow（Use）

Highlights

Key sentences worth saving and sharing.

If you spent like half an hour using any of these models, you'll know very quickly that these scores don't necessarily mean much.
— 3:17
⬇︎ 下载 PNG 𝕏 分享到 X
Meta came out with a new model... it was a huge disappointment because it was benchmark max.
— 3:29
⬇︎ 下载 PNG 𝕏 分享到 X
Most people know a lot of things about evals... but they're wrong about evals.
— 0:55
⬇︎ 下载 PNG 𝕏 分享到 X

#AI Evaluation#Agent Systems#Benchmarking#LLM#Engineering Practice