Enables reliable, training-free emotion steering in speech-generative audio models via direct manipulation of specific emotion-sensitive neurons.
arXiv · March 19, 2026 · 2603.17231
The Takeaway
By identifying and intervening on ESNs at inference time, it achieves controllable emotional speech without the linguistic degradation (hallucinations/refusals) common in prompting-based methods. This establishes a mechanistic path for fine-grained control of multimodal LLMs.
From the abstract
Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are i