Instead of using top-activating examples, this method steers Sparse Autoencoder (SAE) features in Vision-Language Models to let the model describe its own internal visual features.
March 25, 2026
Original Paper
Language Models Can Explain Visual Features via Steering
arXiv · 2603.22593
The Takeaway
It shifts automated interpretability from correlation-based observations (what images activate this feature?) to causal interventions (what does the model think this feature looks like?). This complements traditional interpretability tools and scales better with model size, providing more intuitive explanations of vision encoders.
From the abstract
Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. The