Reveal The Mystery Of DPO: The Connection Between DPO And RL Algorithms
2025 Β· Xuerui Su, Yue Wang, Jinhua Zhu, et al.
Abstract
With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignment-based Direct Preference Optimization (DPO). The mismatch between DPO and PPO, such as DPO's use of a classification loss driven by human-preferred data, has raised confusion about whether DPO should be classified as a Reinforcement Learning (RL) algorithm. To address these ambiguities, we focus on three key aspects related to DPO, RL, and other RLHF algorithms: (1) the construction of the loss function; (2) the target distribution at which the algorithm converges; (3) the impact of key components within the loss function. Specifically, we first establish a unified framewor
Authors
(none)
Tags
Stats
Related papers
- Understanding The Performance Gap In Preference Learning: A Dichotomy Of RLHF And DPO (2025)0.00
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00
- A General Theoretical Paradigm To Understand Learning From Human Preferences (2023)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training (2026)0.00
- Direct Multi-turn Preference Optimization For Language Agents (2024)3.45
- What's Behind Ppo's Collapse In Long-cot? Value Optimization Holds The Secret (2025)0.00