Hybrid Distillation Policy Optimization (HDPO) overcomes the 'vanishing gradient' problem for hard mathematical prompts that RL agents cannot solve.
March 26, 2026
Original Paper
HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
arXiv · 2603.23871
The Takeaway
By using privileged self-distillation on 'cliff prompts' where RL rollouts usually fail, HDPO provides a learning signal where none previously existed. This significantly improves coverage and exploration efficiency in training reasoning models like Qwen-Math or Llama-Math.
From the abstract
Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail