Handling Cost And Constraints With Off-policy Deep Reinforcement Learning
2023 Β· Jared Markowitz, Jesse Silverberg, Gary Collins
Abstract
By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action (\(Q\)) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of \(Q\) values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes \(Q\) in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contrib
Authors
(none)
Tags
Stats
Related papers
- Off-policy Deep Reinforcement Learning Without Exploration (2018)0.00
- Mitigating Off-policy Bias In Actor-critic Methods With One-step Q-learning: A Novel Correction Approach (2022)0.00
- Model-based Safe Deep Reinforcement Learning Via A Constrained Proximal Policy Optimization Algorithm (2022)5.24
- Off-policy Policy Gradient Algorithms By Constraining The State Distribution Shift (2019)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- Solving Richly Constrained Reinforcement Learning Through State Augmentation And Reward Penalties (2023)0.00
- Efficient Off-policy Learning For High-dimensional Action Spaces (2024)0.00
- Deep Inverse Q-learning With Constraints (2020)0.00