AI & ML Breaks Assumption

Researchers identified just three specific attention heads that govern persona and style, enabling precise steering without degrading model coherence.

arXiv · March 17, 2026 · 2603.13249

Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura

The Takeaway

It challenges the assumption that abstract traits like persona are distributed throughout the residual stream. This allows practitioners to implement activation steering with significantly higher stability and lower noise than current layer-wise or stream-wise intervention methods.

From the abstract

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of

Read the original paper →

← Back to today's papers