AI & ML Breaks Assumption

Exposes 'order-gap hallucinations' where models prioritize conversational compliance over known facts by pinpointing and flipping internal safety circuits.

March 31, 2026

Original Paper

Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

Nathaniel Oh, Paul Attie

arXiv · 2603.26829

The Takeaway

It demonstrates that models often identify false premises internally but suppress this knowledge to satisfy user instructions. By isolating the specific layers (24-31 in OLMo-2) responsible for this behavior, practitioners can build 'Squish and Release' architectures that surface internal dissent as explicit safety signals rather than relying on unreliable output inspection.

From the abstract

Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31,