57 percent of users have zero or negative correlation with the global leaderboards used to rank AI models.
April 24, 2026
Original Paper
Personalized Benchmarking: Evaluating LLMs by Individual Preferences
arXiv · 2604.18943
The Takeaway
Global benchmarks assume that a single best model exists for every task. Individual preferences vary so much that the top-ranked model on a leaderboard is often the worst choice for a specific person. This study shows that aggregate scores hide a massive diversity in how people actually interact with language models. What one person finds helpful, another person might find repetitive or frustrating. Model selection should be based on personal usage patterns rather than a static global average.
From the abstract
With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We co