Enables VideoLLMs to perform complex logical reasoning simultaneously with video playback without incurring the latency of standard test-time scaling.
arXiv · March 13, 2026 · 2603.12262
Why it matters
Existing VideoLLMs struggle with the trade-off between reasoning depth and real-time responsiveness. This paper introduces a 'thinking while watching' mechanism that amortizes reasoning latency over the video stream, allowing for sophisticated, multi-turn interaction in real-time online environments.
From the abstract
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasonin