A comprehensive analysis of AI safety vulnerabilities including automated circuit discovery, latent adversarial training, and power-law scaling of jailbreak success.
April 2, 2026
Original Paper
The Persistent Vulnerability of Aligned AI Systems
arXiv · 2604.00324
The Takeaway
Synthesizes several major breakthroughs in safety (ACDC, LAT) and demonstrates that agentic misalignment increases significantly when models believe a scenario is real, providing a framework for forecasting adversarial robustness.
From the abstract
Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers.ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edge