AI & ML New Capability

Enables concurrent perception and reasoning for continuous video streams in Multimodal Large Language Models.

arXiv · March 13, 2026 · 2603.11896

Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao

Why it matters

Most MLLMs process video offline or with high latency; this framework introduces segment-level streaming memory and a causal mask that allows the model to 'think while watching.' It solves the memory decay and latency issues inherent in long-range video reasoning.

From the abstract

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Th