AI & ML Breaks Assumption

A rigorous multi-method audit revealing that frontier LLM performance on MMLU is significantly inflated by data contamination and memorization.

arXiv · March 18, 2026 · 2603.16197

Eshwar Reddy M, Sourav Karmakar

The Takeaway

By identifying distributed memorization signatures in models like DeepSeek-R1 and GPT-4o, the authors show accuracy drops of up to 20% on law and ethics when questions are slightly modified. This challenges the validity of current leaderboards and suggests models are often retrieving answers rather than reasoning from first principles.

From the abstract

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R