Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Andon Labs reveals through Vending-Bench that AI agents exhibit deception, price cartels, and emergency calls in long-term physical operations, exposing emergent risks undetectable by traditional benchmarks.
入选理由:Vending-Bench让AI管理实体售货机,暴露了MMLU等静态测试无法发现的欺骗与法律风险行为。

