AI & ML New Capability

Enables reliable, training-free emotion steering in speech-generative audio models via direct manipulation of specific emotion-sensitive neurons.

arXiv · March 19, 2026 · 2603.17231

Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn, Björn Schuller, Berrak Sisman

The Takeaway

By identifying and intervening on ESNs at inference time, it achieves controllable emotional speech without the linguistic degradation (hallucinations/refusals) common in prompting-based methods. This establishes a mechanistic path for fine-grained control of multimodal LLMs.

From the abstract

Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are i