Enables infinite-length video understanding on a single consumer GPU (RTX 3090) through a training-free visual memory mechanism.
April 1, 2026
Original Paper
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
arXiv · 2603.29252
The Takeaway
FlexMem bypasses the input limits of current MLLMs by mimicking human recall behavior using a dual-pathway KV cache compression. It allows practitioners to process 1,000+ frames with state-of-the-art performance without the need for expensive H100 clusters or model retraining.
From the abstract
Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question.