Abstract

By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action (\(Q\)) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of \(Q\) values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes \(Q\) in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contrib

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers

Handling Cost And Constraints With Off-policy Deep Reinforcement Learning β€” reinforcement-learning