AI & ML Scaling Insight

Adversarial prompt injection causes jailbreak success rates to transition from polynomial to exponential scaling with inference-time samples.

arXiv · March 13, 2026 · 2603.11331

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

Why it matters

The paper uses spin-glass theory to explain why long jailbreak prompts create an 'ordered phase' in the model's output distribution. This transition suggests that safety alignment is much more vulnerable to large-scale inference-time sampling than previously thought.

From the abstract

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where genera