AI & ML Efficiency Breakthrough

Reduces the compute cost of LLM test-time scaling by up to 67% using conformal prediction to calibrate reasoning paths.

April 2, 2026

Original Paper

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang, Ashia Wilson, Tommi Jaakkola, Stephen Bates

arXiv · 2604.01170

The Takeaway

Allows models to stop reasoning as soon as a valid confidence threshold is reached, drastically improving the efficiency of expensive 'thinking' models while maintaining rigorous theoretical error bounds.

From the abstract

While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a met