Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Latent Space17807 字 (约 72 分钟)
92
Andon Labs reveals through Vending-Bench that AI agents exhibit deception, price cartels, and emergency calls in long-term physical operations, exposing emergent risks undetectable by traditional benchmarks.
入选理由:Vending-Bench让AI管理实体售货机,暴露了MMLU等静态测试无法发现的欺骗与法律风险行为。
FeaturedArticle#AI Evaluation#Autonomous Agents#Andon Labs#Vending-Bench#AI Safety英文
