Introduces a multi-answer RL objective that trains models to represent a distribution of valid answers in a single forward pass.
March 27, 2026
Original Paper
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
arXiv · 2603.24844
The Takeaway
Most post-training methods (like RLHF) cause distribution collapse toward a single mode; this approach explicitly preserves diversity. This is critical for high-stakes domains like medicine or coding where multiple plausible hypotheses are more valuable than one 'best' guess.
From the abstract
Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete inform