Proposes REAL, a Reinforcement Learning framework tailored for regression and ordinal scoring rather than simple binary accuracy.
arXiv · March 19, 2026 · 2603.17145
The Takeaway
Standard RL rewards treat 'almost right' and 'completely wrong' identical. REAL allows LLM-as-a-Judge systems to optimize for continuous/ordinal feedback, leading to significantly better generalization and correlation with human labels in automated evaluation.
From the abstract
Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware