Direct Multi-turn Preference Optimization For Language Agents
2024 Β· Wentao Shi, Mengqi Yuan, Junkang Wu, et al.
Abstract
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss. The
Authors
(none)
Tags
Stats
Related papers
- Turn-ppo: Turn-level Advantage Estimation With PPO For Improved Multi-turn RL In Agentic Llms (2025)0.00
- End-to-end Optimization Of Llm-driven Multi-agent Search Systems Via Heterogeneous-group-based Reinforcement Learning (2025)0.00
- DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training (2026)0.00
- Reveal The Mystery Of DPO: The Connection Between DPO And RL Algorithms (2025)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00