T
traeai
登录
返回首页
Thomas Wolf(@Thom_Wolf)

I'm very excited about this extension to the celebrated Terminal-Bench to science. If you're a scie...

7.5Score
I'm very excited about this extension to the celebrated Terminal-Bench to science.

If you're a scie...

TL;DR · AI 摘要

Thomas Wolf is excited about the extension of Terminal-Bench to scientific fields, known as Terminal-Bench Science. This benchmark evaluates AI models' ability to control tools via the command line to achieve scientific goals. It's open for contributions of real scientific workflows until August 2026, aiming to improve AI models' assistance in research work.

核心要点

  • Terminal-Bench Science evaluates AI models' performance in handling scientific workflows through command-line tools.
  • The benchmark is open for contributions of real scientific workflows until August 2026.
  • This initiative aims to enhance AI models' capabilities in assisting daily research work across various scientific disciplines.

结构提纲

按章节快速跳转。

  1. Thomas Wolf announces the extension of Terminal-Bench to scientific fields, highlighting its importance for AI in science.

  2. Terminal-Bench Science is a benchmark designed to evaluate AI models' ability to control tools via the command line to achieve scientific goals.

  3. The benchmark is open for scientists to contribute their real-world scientific workflows until August 2026.

  4. The goal is to improve AI models' assistance in daily research work by diversifying and expanding the benchmark with various scientific workflows.

  5. It's important to note that this is an evaluation tool, not a training dataset, aimed at assessing the performance of frontier AI models.

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • Terminal-Bench Science

金句 / Highlights

值得收藏与分享的关键句。

  • Terminal-Bench Science is benchmarking AI agents on real scientific workflows and is now open for task contributions.

    Steven Dillmann's tweet

    ⬇︎ 下载 PNG𝕏 分享到 X
  • AnthropicAI, OpenAI, and GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks.

    Steven Dillmann's tweet

    ⬇︎ 下载 PNG𝕏 分享到 X
  • The more workflows and the more diverse they are, the better the next generation of AI models will be at helping you in your daily research work.

    Thomas Wolf's tweet

    ⬇︎ 下载 PNG𝕏 分享到 X
#AI#Science#Terminal-Bench#Benchmarking#Command Line
打开原文

我非常兴奋地看到这个著名的终端基准测试扩展到了科学领域。如果你是一名对人工智能感兴趣的科学家(生命科学、物理科学、地球科学、数学科学等),一定要看看这个!终端基准测试评估人工智能模型在计算机上通过命令行控制工具来实现目标的能力。现在,T-Bench Science将这一测试扩展到了“科学领域的人工智能”,并且呼吁大家在2026年8月之前贡献自己的真实科学工作流程到这个基准测试中。工作流程越多,越多样化,下一代人工智能模型在帮助你日常研究工作方面就会越好。请注意,这并不是一个训练数据集,而是用来评估前沿模型的性能。

引用

Steven Dillmann @StevenDillmann

五月20日

📣宣布终端基准测试科学:在真实科学工作流程上评估AI代理——现在开放任务贡献!👇 tbench.ai/news/tb-science @AnthropicAI、@OpenAI和@GoogleDeepMind使用终端基准测试来评估AI在编码任务上的表现。我们现在将其扩展到

图片3:图像

AI 可能会生成不准确的信息,请核实重要内容

I'm very excited about this extension to the celebrated Terminal-Bench to science. If you're a scie... | Thomas Wolf(@Thom_Wolf) | traeai