Enables concurrent perception and reasoning for continuous video streams in Multimodal Large Language Models.
arXiv · March 13, 2026 · 2603.11896
Why it matters
Most MLLMs process video offline or with high latency; this framework introduces segment-level streaming memory and a causal mask that allows the model to 'think while watching.' It solves the memory decay and latency issues inherent in long-range video reasoning.
From the abstract
Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Th