Reveals that standard global correlation metrics for LLM judges fail to predict success in 'best-of-n' selection tasks due to within-prompt signal loss.
arXiv · March 16, 2026 · 2603.12520
Why it matters
Practitioners often use LLM-as-a-judge based on high benchmark correlation, but this paper proves that global scores are dominated by baseline effects rather than the ranking ability needed for inference-time selection. It provides a new auditing framework (sNDCG) to fix this 'evaluation blindness'.
From the abstract
Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt.In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement