AI & ML Breaks Assumption

The researchers demonstrate that prompt injection is caused by 'role confusion' in the latent space, where models assign authority based on the style of writing rather than the source of the text.

arXiv · March 16, 2026 · 2603.12277

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

Why it matters

It challenges the assumption that safety training can be solved at the output or interface level. By identifying internal role probes that predict attack success before generation, it opens a new path for structural, mechanistic defenses against injection.

From the abstract

Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achiev