AI & ML New Capability

Introduces a way to train Reward Models that generate 'transferable rubrics'—explicit scoring criteria that improve performance across different tasks and models.

arXiv · March 18, 2026 · 2603.16600

Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

The Takeaway

Reward models are usually black boxes; this framework forces the model to generate a logic-based rubric that a proxy agent can use to predict the same preference. This makes the reward process interpretable and improves data efficiency by 4x compared to standard training.

From the abstract

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Rein