LensWalk introduces a 'reason-plan-observe' loop that allows agents to dynamically control the temporal sampling and density of the videos they analyze.
March 26, 2026
Original Paper
LensWalk: Agentic Video Understanding by Planning How You See in Videos
arXiv · 2603.24558
The Takeaway
Traditional video understanding relies on fixed pre-processing of frames. This framework allows the model to actively 'seek' evidence (e.g., scanning broadly then zooming into specific seconds), improving accuracy on long-video benchmarks by over 5% without any fine-tuning.
From the abstract
The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner