AI & ML New Capability

Identifies functionally complete safety circuits in LLMs via differentiable binary masks, allowing for near-surgical removal of backdoors and jailbreaks.

March 25, 2026

Original Paper

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen

arXiv · 2603.23268

The Takeaway

It moves safety from broad fine-tuning to precise circuit attribution. This allows practitioners to eradicate backdoor triggers (100% to 0.4% success) or fix alignment issues by targeting less than 1% of the model's parameters.

From the abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits