大多数正在使用强化学习（RL）训练代理LLM的人现在有一个默默损坏的训练循环，他们对此一无所知。

clem 🤗(@ClementDelangue)

clem 🤗(@ClementDelangue)2026年5月29日

大多数正在使用强化学习（RL）训练代理LLM的人现在有一个默默损坏的训练循环，他们对此一无所知。

8.5Score

TL;DR · AI 摘要

大多数正在使用强化学习（RL）训练代理LLM的人现在有一个默默损坏的训练循环，他们对此一无所知。单轮RL效果非常好，但当添加工具使模型能在回合中行动时，情况变得复杂，损失会出现无故尖峰，最终导致形状不匹配错误。原因在于每次解析模型输出、检测工具调用、重新标记更新后的对话，都会带来潜在风险。解决方法是遵循一个规则：永远不要重新编码已经解码的标记。保持采样标记在一个缓冲区中，从不重新渲染它们，两种失败模式都会消失。

核心要点

单轮RL效果好，但加入工具后需小心处理，避免形状不匹配错误。
解析和重新标记对话可能会导致梯度落在模型未采样的序列上，导致训练问题。
解决方法是遵循Token-In, Token-Out策略，保持采样标记不变。

结构提纲

按章节快速跳转。

§引言
大多数使用RL训练LLM的人有一个默默损坏的训练循环。
·单轮RL效果
单轮RL效果非常好，但加入工具后情况复杂。
·问题原因
解析和重新标记对话可能会导致梯度落在模型未采样的序列上。
·解决方法
遵循Token-In, Token-Out策略，保持采样标记不变。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

训练LLM的 RL 问题
- 单轮RL效果好
  - 干净的曲线
- 加入工具后的复杂性
  - 损失尖峰
- 问题原因
  - 解析和重新标记对话
- 解决方法
  - Token-In, Token-Out策略

金句 / Highlights

值得收藏与分享的关键句。

单轮RL效果非常好，但加入工具后情况复杂。
— 第 2 段
⬇︎ 下载 PNG 𝕏 分享到 X

#强化学习#LLM

打开原文

Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get https://t.co/tavHyn7ibt" / X

Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get weird. Loss spikes for no reason. Eventually a shape-mismatch error. The culprit: every time you parse the model's output to detect a tool call, then re-tokenize the updated conversation for the next turn, you're rolling the dice. Usually the round-trip gives back the same tokens. Sometimes it doesn't and your gradient lands on a sequence the model never actually sampled. No crash. Just quietly wrong math and a useless gradient signal. The fix is one rule: never re-encode tokens you've decoded. Keep the sampled tokens in one buffer, never re-render them, and both failure modes disappear. That's Token-In, Token-Out done right. Our team just published a beautiful deep-dive on exactly this, including an audit across the major open-weights model families showing most chat templates already support it. Required reading if you're doing multi-turn RL Image 1: 🤗 Image 2: 🔥 qgallouedec-tito.hf.space