AI & ML New Capability

Enables VideoLLMs to perform complex logical reasoning simultaneously with video playback without incurring the latency of standard test-time scaling.

arXiv · March 13, 2026 · 2603.12262

Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

Why it matters

Existing VideoLLMs struggle with the trade-off between reasoning depth and real-time responsiveness. This paper introduces a 'thinking while watching' mechanism that amortizes reasoning latency over the video stream, allowing for sophisticated, multi-turn interaction in real-time online environments.

From the abstract

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasonin