AI & ML Efficiency Breakthrough

Challenges the dominance of on-policy RL for LLMs by introducing a practical off-policy value-based framework that enables data reuse.

March 25, 2026

Original Paper

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu

arXiv · 2603.23355

The Takeaway

Currently, RL for LLMs (PPO, GRPO) is sample-inefficient because it discards trajectories after one update. This work demonstrates that Bellman-update-based off-policy learning (ReVal) can outperform on-policy methods while significantly improving data utilization and convergence speed.

From the abstract

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-ba