Discounted Beta-Bernoulli (DBB) reward estimation solves the variance collapse and sample inefficiency inherent in point-estimation RLVR methods for LLM reasoning.
March 20, 2026
Original Paper
Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards
arXiv · 2603.18444
The Takeaway
As LLM post-training shifts toward RL with verifiable rewards (RLVR), standard point estimation is failing. DBB achieves significant accuracy gains (+12 points OOD) without additional compute by leveraging historical reward statistics to stabilize training.
From the abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate