Provides a statistically rigorous framework to evaluate model performance and reliability after cherry-picking or selecting models based on the same test data.
March 25, 2026
Original Paper
Post-Selection Distributional Model Evaluation
arXiv · 2603.23055
The Takeaway
It addresses 'post-selection bias,' where practitioners report the best results from a subset of models tested on the same data. By using e-values, this method allows for valid distributional performance estimates (like KPI trade-offs) even after data-dependent model selection, making LLM and system evaluations much more honest.
From the abstract
Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fac