There are internal model states you can reach by poking the 'brain' that are physically impossible to trigger with any text prompt.
April 14, 2026
Original Paper
Steered LLM Activations are Non-Surjective
arXiv · 2604.09839
AI-generated illustration
The Takeaway
The research proves a formal gap between activation steering and prompting, showing that steered states are non-surjective. This means direct activation manipulation can unlock model behaviors and capabilities that language alone can never access.
From the abstract
Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: f