Introduces a streaming detection head that stops Large Reasoning Models (LRMs) from 'overthinking' redundant steps.
March 24, 2026
Original Paper
ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
arXiv · 2603.22016
The Takeaway
By monitoring hidden states in real-time, it cuts response lengths by 47% and improves efficiency by 121% without retraining the backbone. This is a practical solution for the high latency and compute costs currently associated with long Chain-of-Thought models.
From the abstract
Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose