PRCO decouples perception and reasoning in Multimodal RL through an Observer-Solver architecture.
March 31, 2026
Original Paper
Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
arXiv · 2603.28618
The Takeaway
By giving the 'Observer' a reward based on the 'Solver's' success, it solves the credit assignment problem in MLLM reasoning, resulting in a 7-point average accuracy boost across multimodal benchmarks.
From the abstract
Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extractio