A training-free visual token pruning framework for Large Vision-Language Models that preserves geometric structure through subspace reconstruction.
March 24, 2026
Original Paper
ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models
arXiv · 2603.21105
The Takeaway
Practitioners can immediately reduce KV-cache memory and inference latency in models like Qwen2.5-VL or LLaVA without any retraining. It uses a lightweight greedy expansion strategy to retain only the most informative visual tokens relative to the text prompt.
From the abstract
Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a gre