AI & ML Paradigm Shift

DSPA performs preference alignment at inference time by steering Sparse Autoencoder (SAE) features, bypassing the need for expensive weight-update training.

March 24, 2026

Original Paper

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith

arXiv · 2603.21461

The Takeaway

It enables prompt-conditional alignment with 4.47x fewer FLOPs than traditional fine-tuning. This allows practitioners to apply preference alignment (like RLHF) as a modular, interpretable inference-time intervention rather than a permanent model update.

From the abstract

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-