Using the best-performing models as anchors for 'LLM-as-a-judge' evaluations significantly reduces the reliability of human ranking correlations.
arXiv · March 18, 2026 · 2603.16848
The Takeaway
It identifies that 'mediocre' anchors provide much better signal for relative model rankings than SOTA models. This finding directly impacts how every major LLM benchmark should be constructed and interpreted to ensure scientific validity.
From the abstract
The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors