大多数正在使用强化学习(RL)训练代理LLM的人现在有一个默默损坏的训练循环,他们对此一无所知。

TL;DR · AI 摘要
大多数正在使用强化学习(RL)训练代理LLM的人现在有一个默默损坏的训练循环,他们对此一无所知。单轮RL效果非常好,但当添加工具使模型能在回合中行动时,情况变得复杂,损失会出现无故尖峰,最终导致形状不匹配错误。原因在于每次解析模型输出、检测工具调用、重新标记更新后的对话,都会带来潜在风险。解决方法是遵循一个规则:永远不要重新编码已经解码的标记。保持采样标记在一个缓冲区中,从不重新渲染它们,两种失败模式都会消失。
核心要点
- 单轮RL效果好,但加入工具后需小心处理,避免形状不匹配错误。
- 解析和重新标记对话可能会导致梯度落在模型未采样的序列上,导致训练问题。
- 解决方法是遵循Token-In, Token-Out策略,保持采样标记不变。
结构提纲
按章节快速跳转。
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- 训练LLM的 RL 问题
- 单轮RL效果好
- 干净的曲线
- 加入工具后的复杂性
- 损失尖峰
- 问题原因
- 解析和重新标记对话
- 解决方法
- Token-In, Token-Out策略
金句 / Highlights
值得收藏与分享的关键句。
单轮RL效果非常好,但加入工具后情况复杂。
Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get https://t.co/tavHyn7ibt" / X
Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error. The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal. The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right. Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL qgallouedec-tito.hf.space