AI & ML Efficiency Breakthrough

REOPOLD achieves 10x better sample efficiency in reasoning distillation, enabling 7B models to match 32B teachers with significantly less training data.

arXiv · March 13, 2026 · 2603.11137

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron

Why it matters

By interpreting on-policy distillation as policy optimization and applying reward clipping/dynamic sampling, this framework stabilizes the transfer of reasoning. It specifically solves the 'negative transfer' problem where a student fails to learn from a teacher that is too far ahead in capability.

From the abstract

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imit

Read the original paper →

← Back to today's papers