Introduces a way to train Reward Models that generate 'transferable rubrics'—explicit scoring criteria that improve performance across different tasks and models.
arXiv · March 18, 2026 · 2603.16600
The Takeaway
Reward models are usually black boxes; this framework forces the model to generate a logic-based rubric that a proxy agent can use to predict the same preference. This makes the reward process interpretable and improves data efficiency by 4x compared to standard training.
From the abstract
Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Rein