AI & ML Scaling Insight

This paper proves that increasing test-time compute via beam search can actually hurt LLM reasoning performance due to overestimation bias.

arXiv · March 17, 2026 · 2603.15377

Gal Dalal, Assaf Hallak, Gal Chechik, Yftach Ziser

The Takeaway

It establishes a theoretical 'maximum useful beam width' based on the signal-to-noise ratio of the reward model (PRM). This challenges the current trend of blindly scaling inference compute and provides a formula to predict when wider search becomes counterproductive.

From the abstract

Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citep{qin2025dsbd, freitag2017beam}, without analyzing whether wider search can \emph{hurt} output quality. We present an analysis, grounded in Extreme Value Theory, that answers this question. Beam selection over noisy scorer outputs introduces a systematic overestimation bias that grows with the candidate pool size, and we derive a maximum u