AI & ML Breaks Assumption

Identifies a structural flaw in the standard Expected Calibration Error (ECE) when applied to soft labels and introduces SMECE to fix it.

arXiv · March 17, 2026 · 2603.14092

Michael Leznik

The Takeaway

Many modern ML tasks use soft labels (e.g., knowledge distillation, teacher models), yet practitioners still use binary ECE, which provides wrong answers even with infinite data. SMECE is a strict generalization that ensures calibration metrics remain valid for the probabilistic labels common in advanced training pipelines.

From the abstract

The Expected Calibration Error (ece), the dominant calibration metric in machine learning, compares predicted probabilities against empirical frequencies of binary outcomes. This is appropriate when labels are binary events. However, many modern settings produce labels that are themselves probabilities rather than binary outcomes: a radiologist's stated confidence, a teacher model's soft output in knowledge distillation, a class posterior derived from a generative model, or an annotator agreemen