CleanSight provides a training-free, test-time defense for backdoored vision-language models by detecting and pruning 'attention stealing' visual tokens.
arXiv · March 16, 2026 · 2603.12989
Why it matters
Defending against backdoors usually requires expensive fine-tuning on clean data; this method operates purely at inference time by identifying abnormal cross-modal attention patterns, offering a plug-and-play security layer for deployed VLMs.
From the abstract
Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new