AI & ML Paradigm Challenge

Distilled datasets often fail to beat random image selection once the soft label trick is removed from the equation.

April 24, 2026

Original Paper

Rethinking Dataset Distillation: Hard Truths about Soft Labels

arXiv · 2604.18811

The Takeaway

Synthetic data benchmarks have credited specialized distillation methods with breakthrough performance for years. Researchers have spent immense resources trying to compress massive datasets into a few dozen synthetic images. This evaluation reveals that the success was actually an artifact of how labels were formatted rather than algorithmic progress. When tested with standard labels, these specialized models performed no better than a random sample of the original data. This discovery forces a reset of the entire field of dataset distillation.

From the abstract

Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to exam

Read the original paper →

← Back to today's papers