Achieves 6x compute reduction in Multimodal LLMs while actually improving accuracy by 2%.
March 27, 2026
Original Paper
ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs
arXiv · 2603.24680
The Takeaway
Unlike previous methods that prune tokens after projection, ReDiPrune operates on the raw encoder outputs using a relevance-diversity rule, solving the common trade-off where efficiency usually costs performance.
From the abstract
Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and s