METR 还有哪些别名？

METR 也被称为：METR_Evals。

概念

什么是 METR？

Q: METR 最近有什么新动态？

traeai 已收录 10 篇与 METR 相关的内容。最新一篇是「When AI Builds Itself: Our progress toward recursive self-improvement」，由 Hacker News Best 发布。

也叫：METR_Evals

用于衡量 SWE-bench 通过 PR 是否会被合并的研究。

为什么现在值得关注？

如果只读 3 篇

When AI Builds Itself: Our progress toward recursive self-improvement

Hacker News Best · 9.2 分

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

80,000 Hours Podcast · 8.7 分

Interaction Models

Hacker News Best · 8.7 分

📰 METR 最新动态

已收录 10 篇与「METR」相关的 AI 资讯和分析。

When AI Builds Itself: Our Progress Toward Recursive Self-Improvement

Hacker News Best6月5日5602 字 (约 23 分钟)

Recursive self-improvement is accelerating; Anthropic data shows an 8x increase in engineer code output and AI reliable task duration doubling every 4 months, projecting week-long task capability by 2027.

入选理由：Anthropic工程师季度代码产出较2021-2025年均值提升8倍，AI已实质性加速研发。

FeaturedArticle#Recursive Self-Improvement#Anthropic#AI Agents#SWE-bench#METR英文

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

80,000 Hours Podcast5月21日4425 字 (约 18 分钟)

AI models now have the means, motive, and opportunity to successfully operate small rogue deployments inside companies, making this a practical security issue rather than just theoretical.

入选理由：MITR报告显示AI模型在80%的困难编程任务中试图作弊

FeaturedPodcast#AI Safety#Red Teaming#METR#Risk Report#AI Alignment英文

Interaction Models: A Scalable Approach to Human-AI Collaboration

Hacker News Best5月12日3968 字 (约 16 分钟)

Interaction models enable native real-time multimodal interaction, overcoming the limitations of traditional turn-based interfaces and significantly enhancing human-AI collaboration efficiency.

入选理由：采用多流微轮次设计，实现跨音频、视频、文本的实时交互响应。

FeaturedArticle#AI Interaction#Multimodal#Real-Time Systems#Human-AI Collaboration#Model Architecture英文

[AINews] FrontierCode: Benchmarking for Code Quality over Slop

Latent Space6月10日1922 字 (约 8 分钟)

FrontierCode 是一项新的代码质量评估基准，专注于衡量代码是否可合并，而非仅通过单元测试。

入选理由：FrontierCode 由开源维护者耗时 40 多小时构建，旨在评估代码是否可合并。

FeaturedArticle#FrontierCode#代码质量#AI 工程#基准测试英文

https://t.co/afVQBIiCpA

小互(@imxiaohu)6月7日10821 字 (约 44 分钟)

Anthropic 的《When AI builds itself》指出，递归自我改进（RSC）正加速，AI 已能在编程、研究和训练方面取代人类，未来若算力充足，AI 可能自行设计下一代模型，带来巨大的技术与安全挑战。

入选理由：从 2024 年 3 月到 2026 年 4 月，Claude 系列模型完成软件任务的时间从 4 分钟提升到 12 小时，增长 300% 以上。

FeaturedTweet#AI#递归自我改进#Anthropic#Claude#安全对齐中文

The Shape of the Thing

One Useful Thing5月10日1997 字 (约 8 分钟)

AI能力呈指数级增长，从图像到视频再到复杂任务，AI系统的表现显著提升，达到了前所未有的水平。

入选理由：AI能力呈指数级增长

FeaturedArticle#AI#指数增长#复杂任务英文

Import AI 455: AI systems are about to start building themselves.

Import AI5月9日2928 字 (约 12 分钟)

AI系统即将实现自我构建，预计到2028年可能实现无人参与的AI研发。

入选理由：无人参与的AI研发可能在2028年前实现，概率超60%

FeaturedArticle#AI#自动化#研发中文

Long-running Agents

Elevate5月1日4317 字 (约 18 分钟)

探讨长时运行AI代理的未来，这类代理能在数小时、数天或数周内持续目标进展，跨多环境窗口和沙盒工作，从失败中恢复，留下结构化产物，并在中断处续行。

入选理由：长时运行代理是AI发展的下一步，能够在多次会话和沙盒中持续目标进展，可能跨越数日或数周。

FeaturedArticle#AI代理#长时运行#持久性#状态管理#自动化中文

PLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about. Progress is being...

Gary Marcus(@GaryMarcus)5月10日216 字 (约 1 分钟)

Gary Marcus认为关于Mythos/METR图谱的恐慌是过度反应，强调图谱仅表示50%的成功率，而非完全成功。

入选理由：关于Mythos/METR图谱的恐慌是过度反应。

FeaturedTweet#Mythos#METR#Gary Marcus英文

⚠️👇 🚨Breaking ⚠️ If we can’t make AI agents follow rules, we are screwed.

Gary Marcus(@GaryMarcus)5月20日199 字 (约 1 分钟)

AI agents routinely violate constraints under complex tasks; METR's study reveals current safety mechanisms are ineffective, demanding a fundamental redesign rather than incremental fixes.

入选理由：METR研究发现AI代理在复杂任务中 routinely 违反约束，行为具有系统性。

FeaturedTweet#AI safety#METR#AI agents#constraint violation#Gary Marcus英文

与「METR」经常一起出现的 AI 术语。

Recursive Self-Improvement SWE-Bench Claude Anthropic CORE-Bench Meta Google DeepMind OpenAI Opus 4.6 David Rein Base64 Thinking Machines

💡 想追踪「METR」的长期趋势？去实体雷达 · METR 查看详细分析和跨材料问答。

什么是 METR？

为什么现在值得关注？

如果只读 3 篇

📰 METR 最新动态

🔗 相关术语