Fixes the 'squeezing effect' in Direct Preference Optimization (DPO) using an efficient logit-space Sharpness-Aware Minimization.
March 20, 2026
Original Paper
Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
arXiv · 2603.18258
The Takeaway
DPO often suffers from likelihood displacement where the probability of preferred responses decreases during training. This paper identifies the cause in the logit space curvature and provides a computationally 'free' fix that improves alignment stability across various model scales.
From the abstract
Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dyn