AI & ML Paradigm Shift

Demonstrates that perplexity/log-likelihood is a deceptive metric for model distillation, often masking massive drops in actual generation quality.

March 30, 2026

Original Paper

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

arXiv · 2603.26556

The Takeaway

The authors show that models with nearly identical perplexity can differ by over 20% in autoregressive accuracy. This forces a shift toward generation-based evaluation during the distillation process for any practitioner building smaller, efficient LLMs.

From the abstract

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure im