AI & ML Efficiency Breakthrough

Provides a mathematically grounded, efficient offline policy optimization method for Diffusion LLMs by estimating trajectory probabilities with a single forward pass.

March 20, 2026

Original Paper

dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Wenxuan Zhang, Lemeng Wu, Changsheng Zhao, Ernie Chang, Mingchen Zhuge, Zechun Liu, Andy Su, Hanxian Huang, Jun Chen, Chong Zhou, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Wei Wen

arXiv · 2603.18806

The Takeaway

Aligning Diffusion LLMs with human preferences has previously been computationally prohibitive. dTRPO reduces this cost by using a re-masked final state to estimate full trajectories, enabling scaled-up offline training that yields significant gains in reasoning and coding tasks.

From the abstract

Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of inte