工具使用代理的可解释性研究

TL;DR · AI 摘要
论文揭示工具使用代理在决策阶段存在显著的识别与执行不一致,匹配率26-54%,主要集中在认知到行动的过渡阶段。
核心要点
- 模型识别应调用工具但实际未执行,匹配率26-54%
- 问题集中在认知到行动的过渡阶段,而非认知本身
- 论文尝试预测有效干预措施,强调提示和训练的重要性
结构提纲
按章节快速跳转。
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- 工具使用代理的可解释性问题
- 核心发现
- 匹配率26-54%
- 问题集中在认知到行动的过渡阶段
- 解决方案探索
- 论文尝试预测有效干预措施
- 强调提示和训练的重要性
金句 / Highlights
值得收藏与分享的关键句。
模型识别应调用工具但实际未执行,匹配率26-54%
问题集中在认知到行动的过渡阶段,而非认知本身
论文尝试预测有效干预措施,强调提示和训练的重要性
The authors probe hidden states and find the model often recognizes it should call a tool, but fails to actually call one. The mismatch ranges from 26 to 54%, and it concentrates entirely in the cognition-to-action https://t.co/QNMWztDDKB" / X
Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recognizes it should call a tool, but fails to actually call one. The mismatch ranges from 26 to 54%, and it concentrates entirely in the cognition-to-action transition, not in cognition itself. In other words, the model usually knows it should call the tool. The internal probe direction is decodable. But the late-layer last-token regime rotates that signal nearly orthogonal to the action it produces. This work tries to predict which interventions will actually work and which will not. Most will blame bad prompting or weak tool-call training, and probably ignore the late-layer geometry. If you have been A/B testing tool-use prompts and getting weird ceilings, this work might offer a good explanation to that behavior. Paper: arxiv.org/abs/2605.14038 Learn to build effective AI agents in our academy: academy.dair.ai