AI & ML Scaling Insight

Demonstrates that LLM judge panels follow power-law discovery curves, where panel size and persona diversity are critical for uncovering edge-case failures.

April 2, 2026

Original Paper

Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation

HyunJoon Jung, William Na

arXiv · 2604.00477

The Takeaway

It provides a quantitative framework for evaluating conversational AI, showing that score accuracy saturates much faster than unique issue discovery. This insight allows practitioners to optimize judge panel sizes and proves that structured persona-conditioning is essential for robust evaluation.

From the abstract

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discover