AI & ML New Capability

Proposes REAL, a Reinforcement Learning framework tailored for regression and ordinal scoring rather than simple binary accuracy.

arXiv · March 19, 2026 · 2603.17145

Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik

The Takeaway

Standard RL rewards treat 'almost right' and 'completely wrong' identical. REAL allows LLM-as-a-Judge systems to optimize for continuous/ordinal feedback, leading to significantly better generalization and correlation with human labels in automated evaluation.

From the abstract

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware

Read the original paper →

← Back to today's papers