Identifies and solves the 'information self-locking' failure mode where RL-trained agents stop asking informative questions in active reasoning tasks.
arXiv · March 13, 2026 · 2603.12109
Why it matters
Reveals that standard outcome-based rewards lead to a feedback loop of exploration failure. The proposed directional critique method provides a blueprint for training more 'curious' and effective reasoning agents.
From the abstract
Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we