Identifies 'diversity collapse' in the popular GRPO reinforcement learning method and introduces MUPO to maintain broad reasoning paths.
April 2, 2026
Original Paper
All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
arXiv · 2604.00479
The Takeaway
As more researchers adopt DeepSeek-style GRPO for reasoning models, this paper highlights a critical failure mode where models converge prematurely to narrow strategies. MUPO enables more robust reasoning scaling by incentivizing divergent thinking during the RL process.
From the abstract
Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet nar