AI & ML New Capability

Uses Sparse Autoencoders (SAEs) to mechanisticially repair 'moral indifference' in LLM latent representations.

arXiv · March 17, 2026 · 2603.15615

Lingyu Li, Yan Teng, Yingchun Wang

The Takeaway

Instead of behavioral fine-tuning (which only changes surface outputs), this method uses SAEs to identify and reconstruct internal moral topological features. This creates a more robust form of alignment that resists adversarial attacks and improves moral reasoning granularity.

From the abstract

Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon P

Read the original paper →

← Back to today's papers