AI & ML Paradigm Shift

Derives an exact, unbiased policy gradient for Reinforcement Learning on Diffusion LLMs, bypassing the need for sequence-level likelihood approximations.

arXiv · March 16, 2026 · 2603.12554

Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

Why it matters

Standard RL methods for diffusion models rely on noisy heuristics because diffusion likelihoods are intractable. This work treats denoising as a Markov Decision Process to provide a principled, stepwise RL framework, achieving state-of-the-art results in math and code reasoning for diffusion-based text models.

From the abstract

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoi

Read the original paper →

← Back to today's papers