AI & ML Efficiency Breakthrough

ASAP reduces LVLM computational FLOPs by ~80% with virtually no loss in performance using a training-free KV-Cache pruning recipe.

March 17, 2026

Original Paper

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

Surendra Pathak, Bo Han

arXiv · 2603.14549

The Takeaway

Unlike previous methods that ignore 'attention shift' (token scores skewing), ASAP uses a dynamic bidirectional mask and soft merging. It enables high-resolution visual processing on resource-constrained devices without requiring model retraining.

From the abstract

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, w

Read the original paper →

← Back to today's papers