Sparse Autoencoders (SAEs) fail at compositional generalization due to flawed dictionary learning, not the inference method.
March 31, 2026
Original Paper
Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
arXiv · 2603.28744
The Takeaway
It identifies a fundamental 'amortization gap' in SAEs and proves that current dictionary learning methods are the primary bottleneck in neural interpretability. This shifts the research focus from improving SAE encoders to the more difficult challenge of scalable dictionary learning for sparse inference.
From the abstract
The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent