A Provably-efficient Model-free Algorithm For Constrained Markov Decision Processes
2021 Β· Honghao Wei, Xin Liu, Lei Ying
Abstract
This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three "Q" values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves \(\tilde\{\cal O\}\left(\frac\{1 \}\{\delta\}H^4 S^\{\frac\{1\}\{2\}\}A^\{\frac\{1\}\{2\}\}K^\{\frac\{4\}\{5\}\} \right)\) regret, where \(K\) is the total number of ep
Authors
(none)
Tags
Stats
Related papers
- A Safe Exploration Approach To Constrained Markov Decision Processes (2023)0.00
- Provably Efficient Exploration In Constrained Reinforcement Learning:posterior Sampling Is All You Need (2023)0.00
- Model-free Reinforcement Learning In Infinite-horizon Average-reward Markov Decision Processes (2019)0.00
- A Model-free Learning Algorithm For Infinite-horizon Average-reward Mdps With Near-optimal Regret (2020)0.00
- Achieving Zero Constraint Violation For Constrained Reinforcement Learning Via Primal-dual Approach (2021)9.59
- Provably Efficient Primal-dual Reinforcement Learning For Cmdps With Non-stationary Objectives And Constraints (2022)0.00
- Learning General Parameterized Policies For Infinite Horizon Average Reward Constrained Mdps Via Primal-dual Policy Gradient Algorithm (2024)0.00
- Efficient Exploration In Average-reward Constrained Reinforcement Learning: Achieving Near-optimal Regret With Posterior Sampling (2024)0.00