AI & ML Nature Is Weird

Invisible mathematical shifts in a prompt's embedding can bypass AI safety filters without changing a single letter of the text.

April 29, 2026

Original Paper

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Miles Q. Li, Benjamin C. M. Fung, Boyang Li, Radin Hamidi Rad, Ebrahim Bagheri

arXiv · 2604.24983

The Takeaway

Prompt engineering usually involves finding the right words to trick a model into breaking its rules. This attack bypasses the text layer entirely by optimizing the underlying numerical embeddings of the original tokens. The resulting prompt looks perfectly harmless to a human or a secondary safety classifier, yet it triggers the model to generate prohibited content. This vulnerability highlights a massive blind spot in current guardrails that only inspect the literal characters of an input. Safety systems must now account for adversarial noise in the latent space rather than just the visible string.

From the abstract

Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of th