AI Dev 26 x SF | Ara Khan: Evals Are Broken — Use Them Anyway
TL;DR · AI Summary
AI evals are fundamentally broken—over-reliance on objective metrics misleads—but they remain critical when built, interpreted, and embedded properly in agent workflows.
Key Takeaways
- Current mainstream evals (e.g., Epoch AI, OpenAI benchmarks) suffer from 'false
- Meta's new model scored benchmark-max yet disappointed users—showing evals are d
- Proper eval use requires three steps: build (design), interpret (analyze), and e
Outline
Jump quickly between sections.
Evals are among the most critical aspects of AI agent development, regardless of agent type or complexity—everyone should learn to build, interpret, and use them.
People wrongly assume evals provide objective truth, but near-identical scores often mask significant real-world performance differences.
For example, Sonnet 4.6 scoring 52 vs. nearby models appears precise, yet half an hour of real use reveals scores poorly reflect actual capability.
Many labs optimize solely for eval scores, leading models to diverge from user needs—as demonstrated by Meta's benchmark-max but user-disappointing model.
Transform evals from static metrics into dynamic feedback by building, interpreting, and embedding them directly into agent flows for closed-loop improvement.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- AI Evals Are Broken — But Still Useful
- 问题根源
- 虚假精确性:分数相近 ≠ 能力相当
- eval 被异化为刷分游戏
- 正确使用路径
- 构建 eval(Build)
- 解读 eval(Interpret)
- 嵌入 agent flow(Use)
Highlights
Key sentences worth saving and sharing.
If you spent like half an hour using any of these models, you'll know very quickly that these scores don't necessarily mean much.
Meta came out with a new model... it was a huge disappointment because it was benchmark max.
Most people know a lot of things about evals... but they're wrong about evals.