First training-free method for debiasing reward models using Sparse Autoencoder (SAE) interventions.
arXiv · March 16, 2026 · 2603.12795
Why it matters
By isolating stylistic bias features in SAE latent space, it allows practitioners to suppress 'better-presented' but semantically inferior responses at inference time. This provides an interpretable, zero-retraining path to improving the reliability of RLHF alignment pipelines.
From the abstract
Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based inter