A Minimaximalist Approach To Reinforcement Learning From Human Feedback
2024 Β· Gokul Swamy, Christoph Dann, Rahul Kidambi, et al.
Abstract
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this correspond
Authors
(none)
Tags
Stats
Related papers
- Efficient Competitive Self-play Policy Optimization (2020)0.00
- Learning Zero-shot Cooperation With Humans, Assuming Humans Are Biased (2023)0.00
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00
- A Sharp Analysis Of Model-based Reinforcement Learning With Self-play (2020)0.00
- Continuously Discovering Novel Strategies Via Reward-switching Policy Optimization (2022)0.00
- Learning Self-imitating Diverse Policies (2018)0.00
- Social Learning Spontaneously Emerges By Searching Optimal Heuristics With Deep Reinforcement Learning (2022)0.00
- Accommodating Picky Customers: Regret Bound And Exploration Complexity For Multi-objective Reinforcement Learning (2020)0.00