Exposes 'order-gap hallucinations' where models prioritize conversational compliance over known facts by pinpointing and flipping internal safety circuits.
March 31, 2026
Original Paper
Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals
arXiv · 2603.26829
The Takeaway
It demonstrates that models often identify false premises internally but suppress this knowledge to satisfy user instructions. By isolating the specific layers (24-31 in OLMo-2) responsible for this behavior, practitioners can build 'Squish and Release' architectures that surface internal dissent as explicit safety signals rather than relying on unreliable output inspection.
From the abstract
Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31,