AI & ML Paradigm Shift

Introduces a multi-answer RL objective that trains models to represent a distribution of valid answers in a single forward pass.

March 27, 2026

Original Paper

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim

arXiv · 2603.24844

The Takeaway

Most post-training methods (like RLHF) cause distribution collapse toward a single mode; this approach explicitly preserves diversity. This is critical for high-stakes domains like medicine or coding where multiple plausible hypotheses are more valuable than one 'best' guess.

From the abstract

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete inform