AI & ML Breaks Assumption

Proves that RLHF and DPO alignment cause 'response homogenization,' which effectively breaks standard sampling-based uncertainty estimation methods.

March 26, 2026

Original Paper

The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

Mingyi Liu

arXiv · 2603.24124

The Takeaway

Practitioners relying on semantic clustering of multiple samples to estimate model confidence (e.g., self-consistency) will find these methods fail on aligned models because the model collapses to a single response. This paper identifies a fundamental 'alignment tax' that researchers must account for when building reliable, uncertainty-aware LLM systems.

From the abstract

RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81).A base-vs-instruct ablation confirms the causal role of alignment: the base mode

Read the original paper →

← Back to today's papers