20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna
Current 'state-of-the-art' AI model evaluation is misleading; relying solely on public leaderboards or internal tests often leads to lazy large-model choices—real selection should combine multi-board differences, Elo score volatility, and real-world use cases.
入选理由:不同排行榜(如Arena、Design Arena)对同一图像编辑模型排名差异显著,例如Human模型在不同榜单位置相差5名以上。


