AI & ML Paradigm Shift

Moving beyond coarse reward signals, this paper introduces token-level policy optimization for multimodal reasoning.

March 25, 2026

Original Paper

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng

arXiv · 2603.22847

The Takeaway

By identifying distinct 'perceptual grounding' and 'exploratory' token dynamics, the authors enable fine-grained RL that optimizes the reasoning trajectory itself rather than just the final output. This significantly improves performance on complex visual puzzles and geometry tasks where standard RL often fails due to sparse rewards.

From the abstract

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show t