Claude Pass Rate Below 4%, SaaS-Bench Shatters the 'Fully Automated Office' Illusion of Computer-Use
量子位2718 字 (约 11 分钟)
92
SaaS-Bench evaluation shows mainstream large models have less than 4% complete pass rate on real office tasks, revealing huge challenges for AI fully automated office work.
入选理由:Claude Opus 4.7在106个真实办公任务中仅完全通过3.8%(4个)
FeaturedArticle#AI Agent#Large Model Evaluation#Automated Office#SaaS-Bench#Claude中文
