AI & ML Breaks Assumption

Proves that policy gradient algorithms naturally collapse entropy and provides a mathematical fix to preserve exploration and diversity.

arXiv · March 13, 2026 · 2603.11682

Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, Philipp Krähenbühl

Why it matters

It identifies a fundamental flaw in the training dynamics of popular RL algorithms (like those used for LLM reasoning) where entropy reduction limits future trainability. By introducing entropy control mechanisms like REPO and ADAPO, researchers can prevent training stagnation and maintain solution diversity in complex environments.

From the abstract

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy -- and thus the diversity of explored trajectories -- as part of training, yielding a policy increasingly limited in its ability to explore. In this paper, we argue