A vector-wise sparse attention mechanism that accelerates long-context video inference by 2.6x with zero loss in accuracy.
April 1, 2026
Original Paper
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
arXiv · 2603.29494
The Takeaway
By identifying a unique 'vertical-vector' sparsity pattern in video attention maps, it avoids the redundant computation of coarse-grained sparse patterns used in standard models. It provides a practical path to scaling video transformers to massive context lengths without the usual quadratic memory overhead.
From the abstract
Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-effi