Meituan LongCat Open-Sources General 365: Establishing a New Benchmark for Reasoning Evaluation
美团技术团队2029 字 (约 9 分钟)
85
Meituan opensource General 365 benchmark, revealing the real capability boundary of large models in general reasoning.
入选理由:Gemini 3 Pro 在 General 365 上准确率仅 62.8%,多数模型未达及格线
FeaturedArticle#Large Model#Reasoning Evaluation#General Reasoning中文
