Proves that 'inverse scaling' on many benchmarks is a prompt-dependent artifact caused by verbosity, which can be reversed by forcing brevity.
April 2, 2026
Original Paper
Brevity Constraints Reverse Performance Hierarchies in Language Models
arXiv · 2604.00025
The Takeaway
This challenges the conventional wisdom that larger models sometimes get 'dumber' on specific tasks. It demonstrates that larger models possess superior latent capabilities that are often masked by a tendency to over-elaborate, requiring a shift in how we evaluate model intelligence.
From the abstract
Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correcta