ThinkStream introduces a 'Watch-Think-Speak' paradigm for video reasoning that allows models to incrementally update understanding and decide when to respond in real-time.
arXiv · March 16, 2026 · 2603.12938
Why it matters
Moves beyond batch processing of video which is too slow for real-time assistants; it uses a novel reasoning-compressed memory that replaces raw pixels with semantic traces, significantly lowering latency and memory usage in long-horizon streaming scenarios.
From the abstract
Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--