AI & ML New Capability

First training-free method for debiasing reward models using Sparse Autoencoder (SAE) interventions.

arXiv · March 16, 2026 · 2603.12795

Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye

Why it matters

By isolating stylistic bias features in SAE latent space, it allows practitioners to suppress 'better-presented' but semantically inferior responses at inference time. This provides an interpretable, zero-retraining path to improving the reliability of RLHF alignment pipelines.

From the abstract

Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based inter