AI & ML Efficiency Breakthrough

Sparse Feature Attention (SFA) reduces attention costs from quadratic in sequence length and linear in dimension to a fraction based on feature sparsity, enabling 2.5x speedups.

March 25, 2026

Original Paper

Scaling Attention via Feature Sparsity

Yan Xie, Tiansheng Wen, Tangda Huang, Bo Chen, Chenyu You, Stefanie Jegelka, Yifei Wang

arXiv · 2603.22300

The Takeaway

Unlike sequence-axis sparsity which often hurts accuracy, feature-axis sparsity maintains high-dimensional expressivity. The inclusion of FlashSFA (an IO-aware kernel) makes this a drop-in efficiency gain for long-context Transformers.

From the abstract

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity

Read the original paper →

← Back to today's papers