This paper demonstrates precise behavioral steering of agentic traits in a 35B parameter MoE model using Sparse Autoencoder (SAE) decoded probe vectors.
arXiv · March 18, 2026 · 2603.16335
The Takeaway
It validates that mechanistic interpretability techniques like SAEs can be used for fine-grained inference-time control of complex agentic behaviors in production-scale models. This bypasses the need for costly RLHF or fine-tuning when adjusting model personality or autonomy.
From the abstract
We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enab