AI Dev 26 x SF | Ara Khan: Evals Are Broken — Use Them Anyway
AI evals are fundamentally broken—over-reliance on objective metrics misleads—but they remain critical when built, interpreted, and embedded properly in agent workflows.
入选理由:当前主流 eval(如 Epoch AI、OpenAI 的 benchmark)存在‘虚假精确性’,模型分数相近时实际能力差异显著。

