Multi-objective Reward And Preference Optimization: Theory And Algorithms
2025 Β· Akhil Agnihotri
Abstract
This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for episodic CMDPs. Built on an episodic policy difference lemma, e-COP offers provable performance, simplicity, and scalability in safety-critical environments. The thesis then investigates reinforcement learning from human preferences. warmPref-PS introduces a posterior sampling strategy for linear bandits that integrates offline preferenc
Authors
(none)
Tags
Stats
Related papers
- ACPO: A Policy Optimization Algorithm For Average Mdps With Constraints (2023)0.00
- Unified Algorithms For RL With Decision-estimation Coefficients: PAC, Reward-free, Preference-based Learning, And Beyond (2022)5.24
- A Generalized Algorithm For Multi-objective Reinforcement Learning And Policy Adaptation (2019)0.00
- Reward Constrained Policy Optimization (2018)0.00
- Accommodating Picky Customers: Regret Bound And Exploration Complexity For Multi-objective Reinforcement Learning (2020)0.00
- Co2po: Coordinated Constrained Policy Optimization For Multi-agent RL (2026)0.00
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00
- Robust Model-free Reinforcement Learning With Multi-objective Bayesian Optimization (2019)11.08