Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea.
Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error. The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the
入选理由:单轮RL效果好,但加入工具后需小心处理,避免形状不匹配错误。






