Simple training methods from years ago are outperforming modern, complex techniques when you control for computing time.
April 29, 2026
Original Paper
The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation
arXiv · 2604.25530
The Takeaway
Semantic segmentation has seen a flood of increasingly complicated knowledge distillation methods recently. This paper proves that basic, canonical distillation is actually superior when the models are allowed the same amount of training time. Much of the reported progress in the field was just a byproduct of using more compute rather than better algorithms. This finding suggests that researchers are over-engineering solutions and overlooking the power of simple baselines. Practitioners should reconsider using complex pipelines when a well-tuned basic model achieves better results for less effort.
From the abstract
Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall