Reveals that many massive LLM benchmarks provide highly redundant information, with major leaderboards often containing only ~2 independent axes of measurement.
April 1, 2026
Original Paper
BenchScope: How Many Independent Signals Does Your Benchmark Provide?
arXiv · 2603.29357
The Takeaway
This challenges the practice of 'chasing numbers' on multi-task leaderboards. By providing a metric for 'Effective Dimensionality', it allows practitioners to identify which benchmarks actually test new capabilities and which are simply measuring the same latent factors.
From the abstract
AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than 8,400 model evaluations, ED reveals substantial redundancy: the six-score Open LLM Leaderboard behaves like roug