Harmful intent in LLMs can be detected geometrically even after safety 'refusal' mechanisms have been surgically removed.
March 31, 2026
Original Paper
The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
arXiv · 2603.27412
The Takeaway
This paper demonstrates a geometric dissociation between intent representation and refusal behavior, showing that 'abliterated' models still encode harmful intent in a tight angular distribution. It provides a training-free, high-precision method (AUROC 0.93+) for detecting malicious queries that bypasses current alignment-based defenses.
From the abstract
We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $\theta$ from this reference direction. The anomaly score is the negative log-likelihood of $\theta$ under a Gaussian fit to the normative distributio