AI & ML Nature Is Weird

You can jailbreak an AI not by tricking its logic, but by using an image to 'blind' it to its own safety rules.

April 15, 2026

Original Paper

Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking

arXiv · 2604.10299

The Takeaway

Adversarial Attention Hijacking shows that LVLMs can be compromised by images that hijack the model's focus, causing it to simply 'forget' to look at its safety instructions. The model doesn't 'decide' to be bad; it literally fails to retrieve its own rules because the image occupies all its 'attention' resources. This is a fundamental shift in AI security: it's not a logic battle, it's a resource battle. This 'blinding' technique renders traditional prompt-based safety filters useless. Security researchers now have to figure out how to protect the model's 'focus' from being hijacked by malicious visual noise.

From the abstract

Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating atten

Read the original paper →

← Back to today's papers