AI & ML New Capability

Safety fine-tuning causes representational collapse in the residual stream, leading to 'false refusals' of benign queries.

arXiv · March 17, 2026 · 2603.13318

Quan Liu, Han Zhou, Wenquan Wu, Hua Wu, Sen Su

The Takeaway

The authors provide FlowLens, a PCA tool to visualize this collapse, and Variance Concentration Loss (VCL) to fix it. This allows practitioners to train safer models that stay helpful on benign inputs, reducing false refusal rates by 35 percentage points.

From the abstract

Ensuring that large language models (LLMs) remain both helpful and harmless poses a significant challenge: fine-tuning on repetitive safety datasets, where unsafe prompts are paired with standard refusal templates, often leads to false refusals, in which benign queries are declined. We first quantify this effect, showing that safety data exhibits substantially lower token entropy and 2-gram diversity (0.048) compared to general instruction data. To uncover the root cause, we introduce FlowLens,