AI & ML Paradigm Shift

Fixes the inherent instability of on-policy distillation in LLMs using local support matching and top-p rollout sampling.

March 27, 2026

Original Paper

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao

arXiv · 2603.25562

The Takeaway

On-policy distillation is key for math and agentic tasks but often fails due to 'rollout drift.' This paper provides the theoretical and empirical fixes needed to make sequence-level RL training stable for large-scale deployment.

From the abstract

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically,