Discovers interpretable 'atoms' of model behavior by decomposing training gradients, enabling unsupervised discovery and steering of complex behaviors like refusal or arithmetic.
arXiv · March 17, 2026 · 2603.14665
The Takeaway
It moves beyond per-document attribution to find broad concepts shared across datasets. These discovered atoms can be applied as weight-space perturbations to precisely control model outputs without retraining or query labels.
From the abstract
Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present G