CRPO: A New Approach For Safe Reinforcement Learning With Convergence Guarantee
2020 Β· Tengyu Xu, Yingbin Liang, Guanghui Lan
Abstract
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy upd
Authors
(none)
Tags
Stats
Related papers
- Last-iterate Global Convergence Of Policy Gradients For Constrained Reinforcement Learning (2024)0.00
- Reward Constrained Policy Optimization (2018)0.00
- Model-based Safe Deep Reinforcement Learning Via A Constrained Proximal Policy Optimization Algorithm (2022)5.24
- Embedding Safety Into RL: A New Take On Trust Region Methods (2024)0.00
- Learning Deterministic Policies With Policy Gradients In Constrained Markov Decision Processes (2025)0.00
- Constraint-conditioned Policy Optimization For Versatile Safe Reinforcement Learning (2023)0.00
- DOPE: Doubly Optimistic And Pessimistic Exploration For Safe Reinforcement Learning (2021)0.00
- Towards Safe Reinforcement Learning Via Constraining Conditional Value-at-risk (2022)0.00