Proves that RLHF and DPO alignment cause 'response homogenization,' which effectively breaks standard sampling-based uncertainty estimation methods.
March 26, 2026
Original Paper
The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
arXiv · 2603.24124
The Takeaway
Practitioners relying on semantic clustering of multiple samples to estimate model confidence (e.g., self-consistency) will find these methods fail on aligned models because the model collapses to a single response. This paper identifies a fundamental 'alignment tax' that researchers must account for when building reliable, uncertainty-aware LLM systems.
From the abstract
RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81).A base-vs-instruct ablation confirms the causal role of alignment: the base mode