Sparse Feature Attention (SFA) reduces attention costs from quadratic in sequence length and linear in dimension to a fraction based on feature sparsity, enabling 2.5x speedups.
March 25, 2026
Original Paper
Scaling Attention via Feature Sparsity
arXiv · 2603.22300
The Takeaway
Unlike sequence-axis sparsity which often hurts accuracy, feature-axis sparsity maintains high-dimensional expressivity. The inclusion of FlashSFA (an IO-aware kernel) makes this a drop-in efficiency gain for long-context Transformers.
From the abstract
Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity