Decouples perceptual failures from logical errors in Vision-Language reward models to enable more reliable test-time scaling.
March 18, 2026
Original Paper
Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models
arXiv · 2603.16253
The Takeaway
Standard VL-PRMs often penalize correct reasoning because they misinterpret the image (perceptual uncertainty). This framework uses explicit visual premise verification to gate rewards, significantly improving the reliability of Best-of-N reranking for complex multimodal reasoning tasks.
From the abstract
Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statemen