AI & ML Paradigm Shift

Identifies 'diversity collapse' in the popular GRPO reinforcement learning method and introduces MUPO to maintain broad reasoning paths.

April 2, 2026

Original Paper

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Tu, Jing Zhang

arXiv · 2604.00479

The Takeaway

As more researchers adopt DeepSeek-style GRPO for reasoning models, this paper highlights a critical failure mode where models converge prematurely to narrow strategies. MUPO enables more robust reasoning scaling by incentivizing divergent thinking during the RL process.

From the abstract

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet nar