AI & ML New Capability

Discovers interpretable 'atoms' of model behavior by decomposing training gradients, enabling unsupervised discovery and steering of complex behaviors like refusal or arithmetic.

arXiv · March 17, 2026 · 2603.14665

J Rosser

The Takeaway

It moves beyond per-document attribution to find broad concepts shared across datasets. These discovered atoms can be applied as weight-space perturbations to precisely control model outputs without retraining or query labels.

From the abstract

Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present G

Read the original paper →

← Back to today's papers