Embedding Safety Into RL: A New Take On Trust Region Methods
2024 · Nikola Milosevic, Johannes Müller, Nico Scherf
Abstract
Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.
Authors
(none)
Tags
Stats
Related papers
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10
- Entrpo: Trust Region Policy Optimization Method With Entropy Regularization (2021)0.00
- Trust-pcl: An Off-policy Trust Region Method For Continuous Control (2017)0.00
- Trust Region Policy Optimisation In Multi-agent Reinforcement Learning (2021)0.00
- Simple Policy Optimization (2024)0.00
- Multi-agent Trust Region Policy Optimization (2020)12.61
- CRPO: A New Approach For Safe Reinforcement Learning With Convergence Guarantee (2020)0.00
- Hindsight Trust Region Policy Optimization (2019)0.00