Identifies functionally complete safety circuits in LLMs via differentiable binary masks, allowing for near-surgical removal of backdoors and jailbreaks.
March 25, 2026
Original Paper
SafeSeek: Universal Attribution of Safety Circuits in Language Models
arXiv · 2603.23268
The Takeaway
It moves safety from broad fine-tuning to precise circuit attribution. This allows practitioners to eradicate backdoor triggers (100% to 0.4% success) or fix alignment issues by targeting less than 1% of the model's parameters.
From the abstract
Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits