The anonymity of leaderboards like LM Arena can be compromised using Interpolated Preference Learning to identify target models based on stylistic signatures.
arXiv · March 17, 2026 · 2603.15220
The Takeaway
This exposes a fundamental security vulnerability in human-voting-based leaderboards, proving that 'blind' tests can be gamed or de-anonymized. It forces the community to rethink how we maintain the integrity of competitive LLM evaluations.
From the abstract
Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that