ASAP reduces LVLM computational FLOPs by ~80% with virtually no loss in performance using a training-free KV-Cache pruning recipe.
March 17, 2026
Original Paper
ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
arXiv · 2603.14549
The Takeaway
Unlike previous methods that ignore 'attention shift' (token scores skewing), ASAP uses a dynamic bidirectional mask and soft merging. It enables high-resolution visual processing on resource-constrained devices without requiring model retraining.
From the abstract
While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, w