AI & ML New Capability

Transitions reasoning model optimization from coarse sequence-level advantages to fine-grained token dynamics.

March 31, 2026

Original Paper

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

Song Yu, Li Li

arXiv · 2603.28204

The Takeaway

Standard GRPO (used in models like DeepSeek-R1) treats all tokens in a sequence equally, leading to entropy collapse and redundant reasoning. ERPO identifies 'Critical Decision Pivots' to amplify exploration exactly where logic forks, significantly improving performance on benchmarks like AIME and MATH.

From the abstract

Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reason