AI & ML Paradigm Shift

Instead of using top-activating examples, this method steers Sparse Autoencoder (SAE) features in Vision-Language Models to let the model describe its own internal visual features.

March 25, 2026

Original Paper

Language Models Can Explain Visual Features via Steering

Javier Ferrando, Enrique Lopez-Cuena, Pablo Agustin Martin-Torres, Daniel Hinjos, Anna Arias-Duart, Dario Garcia-Gasulla

arXiv · 2603.22593

The Takeaway

It shifts automated interpretability from correlation-based observations (what images activate this feature?) to causal interventions (what does the model think this feature looks like?). This complements traditional interpretability tools and scales better with model size, providing more intuitive explanations of vision encoders.

From the abstract

Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. The