Proposes Causal Process Reward (CPR) to fix 'cherry-picking' in MLLM reasoning by coupling answer correctness with step-level logical alignment.
arXiv · March 16, 2026 · 2603.13099
Why it matters
The CPR-Curriculum approach improves reasoning fidelity by 32% via GRPO without requiring manual step-by-step annotation, solving a major bottleneck in scaling reliable reasoning models.
From the abstract
We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline w