T
traeai
Sign in
返回首页
AI HOT 精选

英伟达推出 AI 框架 Polar,让 Codex 跑分暴涨 594.74%

8.5Score
英伟达推出 AI 框架 Polar,让 Codex 跑分暴涨 594.74%

TL;DR · AI Summary

英伟达推出开源框架 Polar,显著提升 Codex 等智能体的性能和效率。

Key Takeaways

  • Polar 框架让 Codex 在 SWE-Bench Verified 测试中的 pass@1 分数提升了 594.74%。
  • Polar 通过在模型 API 边界放置智能体,避免了重写现有框架。
  • Polar 提高了 GPU 利用率,训练效率提升了约 5.39 倍。

Outline

Jump quickly between sections.

  1. 英伟达发布开源框架 Polar,提升 Codex 等智能体的性能。

  2. Polar 在不改变现有框架的情况下,接入 GRPO 训练。

  3. GRPO 通过奖励信号调整模型策略,提升多步决策任务表现。

  4. Polar 在执行框架和推理服务器之间放置模型智能体,兼容多种风格请求。

  5. Polar 提升了 Codex 等智能体在多个测试中的表现和训练效率。

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • Polar 框架

Highlights

Key sentences worth saving and sharing.

  • Polar 框架让 Codex 在 SWE-Bench Verified 的 pass@1 分数分别从 3.8% 提升到 26.4%(增涨 594.74%)。

    实验部分

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Polar 通过在模型 API 边界放置智能体,避免了重写现有框架。

    Polar 的核心设计

    ⬇︎ 下载 PNG𝕏 分享到 X
  • prefix_merging 相比 per_request,把 3 个训练步骤中的更新数从 1185 次降到 218 次,墙钟时间从 189.5 分钟缩短到 35.2 分钟,约快 5.39 倍。

    效率方面

    ⬇︎ 下载 PNG𝕏 分享到 X
#英伟达#Polar#AI 框架#Codex#强化学习
Open original article

IT Home May 28, 2023. NVIDIA's research team has released the open-source framework Polar this week. It allows existing intelligent agent frameworks such as Codex, Claude Code, and Qwen Code to access GRPO (Generalized Relative Policy Optimization) training without disrupting the original tool calls, context organization, or patch submission methods.

Image 1: NVIDIA Launches AI Framework Polar, Boosting Codex Performance by 594.74%

IT Home Note: GRPO is an optimization method for reinforcement learning training that adjusts model strategies based on reward signals, enabling the model to learn better actions in multi-step decision tasks.

In this article, GRPO is primarily used for code agent training, allowing the model to continuously improve its performance in real tool call and patch submission processes.

The paper notes that intelligent agent reinforcement learning is transitioning from single-step tasks to long process tasks, such as code repository modifications, browser operations, and operating system interactions. These tasks often rely on existing execution frameworks that include multiple rounds of calls, tool usage, context compression, and sub-agent collaboration.

The current challenge is that these frameworks are difficult to directly rewrite into traditional reinforcement learning environment interfaces. Forcing them to connect may also result in the loss of key training signals.

NVIDIA Polar does not rewrite intelligent agent frameworks; instead, it focuses on placing agents at the model API boundary with minimal changes to the original harness.

A harness refers to intelligent agent runtimes like Codex CLI, Claude Code, Qwen Code, and Pi. Traditional reinforcement learning infrastructure typically requires rewriting this logic into environment interfaces similar to env.init(), env.step(), and env.reset(). This can be costly and may result in the loss of native execution details.

Image 2

The core design of Polar treats the interface between the agent and the model as the training boundary rather than modifying the execution framework itself into an environment.

It places the model agent between the execution framework and the inference server, compatible with request styles from Anthropic, OpenAI, and Google, records prompts, sampled tokens, log probabilities, and response content during request forwarding, and then reconstructs this information into trajectories consumable by the trainer.

In terms of system structure, Polar consists of a rollout server and a gateway node. The former is responsible for task submission, session scheduling, state persistence, and callback reception; the latter handles the full lifecycle of session execution, including runtime startup, execution framework preparation, trajectory building, result evaluation, and resource recycling.

The paper also splits initialization, running, and post-processing into independent work pools and sets a READY buffer, allowing runtime warming and evaluation warming to run in parallel in the background, reducing the blocking of long-tail tasks on GPU training.

The experimental part focuses on software engineering tasks. Based on the same Qwen3.5-4B base model, after training with Polar and GRPO (Group Relative Policy Optimization) on four code execution frameworks—Codex, Claude Code, Qwen Code, and Pi—the pass@1 scores in SWE-Bench Verified improved from 3.8% to 26.4% (an increase of 594.74%), 29.8% to 34.6%, 34.6% to 35.2%, and 34.2% to 40.4%.

In terms of efficiency, prefix_merging reduces the number of updates in three training steps from 1185 to 218, wall clock time from 189.5 minutes to 35.2 minutes, approximately 5.39 times faster; the average GPU utilization of rollout also increased from 20.4% to 87.7%.

IT Home provides reference links:

Advertising Disclosure: The external jump links (including but not limited to hyperlinks, QR codes, passwords, etc.) within the text are used to convey more information and save selection time. The results are for reference only. All articles on IT Home contain this disclaimer.

AI may generate inaccurate information. Please verify important content.