Provides a mathematically grounded, efficient offline policy optimization method for Diffusion LLMs by estimating trajectory probabilities with a single forward pass.
March 20, 2026
Original Paper
dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models
arXiv · 2603.18806
The Takeaway
Aligning Diffusion LLMs with human preferences has previously been computationally prohibitive. dTRPO reduces this cost by using a re-masked final state to estimate full trajectories, enabling scaled-up offline training that yields significant gains in reasoning and coding tasks.
From the abstract
Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of inte