Towards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement

·2026

arXiv:chen2026towards ↗Google Scholar ↗Semantic Scholar ↗

Abstract

arXiv:2604.18239v2 Announce Type: replace-cross Abstract: Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin-based objectives suppress the chosen response along with the rejected one, a phenomenon known as likelihood displacement, and no general mechanism currently prevents this across objectives. We bridge this gap by presenting a unified *incentive-score decomposition* of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients. Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the *disentanglement band* (DB), a simple, testable condition that characterizes when training can avoid likelihood displacement by realizing the preferred pathway: suppressing the loser while maintaining the winner, possibly after an initial transient. Leveraging the DB, we propose

Abstract

Related papers