Standard alignment metrics like CKA and RSA systematically fail when comparing networks in superposition, often leading to false conclusions about model similarity.
April 2, 2026
Original Paper
Measuring the Representational Alignment of Neural Systems in Superposition
arXiv · 2604.00208
The Takeaway
As the field shifts toward Sparse Autoencoders and mechanistic interpretability, this paper proves that our primary tools for comparing internal representations are fundamentally flawed for compressed features. It dictates that researchers must align underlying sparse features rather than raw neural activations.
From the abstract
Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Si