AI & ML Efficiency Breakthrough

Hybrid Distillation Policy Optimization (HDPO) overcomes the 'vanishing gradient' problem for hard mathematical prompts that RL agents cannot solve.

March 26, 2026

Original Paper

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

Ken Ding

arXiv · 2603.23871

The Takeaway

By using privileged self-distillation on 'cliff prompts' where RL rollouts usually fail, HDPO provides a learning signal where none previously existed. This significantly improves coverage and exploration efficiency in training reasoning models like Qwen-Math or Llama-Math.

From the abstract

Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail