AI & ML Breaks Assumption

Identifies and solves the 'information self-locking' failure mode where RL-trained agents stop asking informative questions in active reasoning tasks.

arXiv · March 13, 2026 · 2603.12109

Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng

Why it matters

Reveals that standard outcome-based rewards lead to a feedback loop of exploration failure. The proposed directional critique method provides a blueprint for training more 'curious' and effective reasoning agents.

From the abstract

Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we