AI & ML Breaks Assumption

Harmful intent in LLMs can be detected geometrically even after safety 'refusal' mechanisms have been surgically removed.

March 31, 2026

Original Paper

The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Isaac Llorente-Saguer

arXiv · 2603.27412

The Takeaway

This paper demonstrates a geometric dissociation between intent representation and refusal behavior, showing that 'abliterated' models still encode harmful intent in a tight angular distribution. It provides a training-free, high-precision method (AUROC 0.93+) for detecting malicious queries that bypasses current alignment-based defenses.

From the abstract

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $\theta$ from this reference direction. The anomaly score is the negative log-likelihood of $\theta$ under a Gaussian fit to the normative distributio

Read the original paper →

← Back to today's papers