Accommodating Picky Customers: Regret Bound And Exploration Complexity For Multi-objective Reinforcement Learning
2020 Β· Jingfeng Wu, Vladimir Braverman, Lin F. Yang
Abstract
In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound \(\widetilde\{\mathcal\{O\}\}\bigl(\sqrt\{\min\\{d,S\\}\cdot H^2 SAK\}\bigr)\), where \(d\) is the number of objectives, \(S\) is the number of states, \(A\) is the number of actions, \(H\) is the length of the horizon, and \(K\) is the number of episodes. Furthermore, w
Authors
(none)
Tags
Stats
Related papers
- Multi-objective Reward And Preference Optimization: Theory And Algorithms (2025)0.00
- A Generalized Algorithm For Multi-objective Reinforcement Learning And Policy Adaptation (2019)0.00
- Taming Equilibrium Bias In Risk-sensitive Multi-agent Reinforcement Learning (2024)0.00
- Tiered Reinforcement Learning: Pessimism In The Face Of Uncertainty And Constant Regret (2022)0.00
- Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function (2019)0.00
- Toward Negotiable Reinforcement Learning: Shifting Priorities In Pareto Optimal Sequential Decision-making (2017)0.00
- Sample-efficient Multi-objective Learning Via Generalized Policy Improvement Prioritization (2023)5.24
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58