Accelerates LLM inference by up to 1.8x using a training-free sparse pattern predictor based on SVD truncation of FFN gate matrices.
March 17, 2026
Original Paper
SVD Contextual Sparsity Predictors for Fast LLM Inference
arXiv · 2603.14110
The Takeaway
Existing contextual sparsity methods often require expensive retraining of predictors; this approach uses simple SVD on the weights to predict activation patterns. It provides a practical, drop-in speedup for ReGLU-based LLMs (like Llama/Mistral) on edge devices without sacrificing accuracy.
From the abstract
Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building