Exploration-exploitation Trade-off In Reinforcement Learning On Online Markov Decision Processes With Global Concave Rewards
2019 Β· Wang Chi Cheung
Abstract
We consider an agent who is involved in a Markov decision process and receives a vector of outcomes every round. Her objective is to maximize a global concave reward function on the average vectorial outcome. The problem models applications such as multi-objective optimization, maximum entropy exploration, and constrained optimization in Markovian environments. In our general setting where a stationary policy could have multiple recurrent classes, the agent faces a subtle yet consequential trade-off in alternating among different actions for balancing the vectorial outcomes. In particular, stationary policies are in general sub-optimal. We propose a no-regret algorithm based on online convex optimization (OCO) tools (Agrawal and Devanur 2014) and UCRL2 (Jaksch et al. 2010). Importantly, we introduce a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off. By delaying the gradient updates, our procedure produces a non-stat
Authors
(none)
Tags
Stats
Related papers
- Conservative Exploration In Reinforcement Learning (2020)0.00
- Exploration Conscious Reinforcement Learning Revisited (2018)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- Online Convex Optimization In Adversarial Markov Decision Processes (2019)0.00
- Conformal Off-policy Evaluation In Markov Decision Processes (2023)7.16
- Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning (2023)0.00
- Learning Deterministic Policies With Policy Gradients In Constrained Markov Decision Processes (2025)0.00
- Safe Reinforcement Learning For Constrained Markov Decision Processes With Stochastic Stopping Time (2024)2.26