AI & ML Efficiency Breakthrough

Enables infinite-length video understanding on a single consumer GPU (RTX 3090) through a training-free visual memory mechanism.

April 1, 2026

Original Paper

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

arXiv · 2603.29252

The Takeaway

FlexMem bypasses the input limits of current MLLMs by mimicking human recall behavior using a dual-pathway KV cache compression. It allows practitioners to process 1,000+ frames with state-of-the-art performance without the need for expensive H100 clusters or model retraining.

From the abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question.