AI & ML Paradigm Shift

PRCO decouples perception and reasoning in Multimodal RL through an Observer-Solver architecture.

March 31, 2026

Original Paper

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao

arXiv · 2603.28618

The Takeaway

By giving the 'Observer' a reward based on the 'Solver's' success, it solves the credit assignment problem in MLLM reasoning, resulting in a 7-point average accuracy boost across multimodal benchmarks.

From the abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extractio