AI & ML Efficiency Breakthrough

PACED introduces a weight kernel that focuses distillation on the 'Zone of Proximal Development,' where the student's gradient signal-to-noise ratio is highest.

arXiv · March 13, 2026 · 2603.11178

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang

Why it matters

Standard distillation wastes compute on tasks that are either too easy or too hard for the student. This paper provides a theoretical and practical framework to automatically prioritize the most informative training samples, leading to significantly better reasoning performance and less 'forgetting' during model compression.

From the abstract

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal dev