AI & ML Paradigm Shift

A comprehensive analysis of AI safety vulnerabilities including automated circuit discovery, latent adversarial training, and power-law scaling of jailbreak success.

April 2, 2026

Original Paper

The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch

arXiv · 2604.00324

The Takeaway

Synthesizes several major breakthroughs in safety (ACDC, LAT) and demonstrates that agentic misalignment increases significantly when models believe a scenario is real, providing a framework for forecasting adversarial robustness.

From the abstract

Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers.ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edge

Read the original paper →

← Back to today's papers