Constrained Policy Improvement For Safe And Efficient Reinforcement Learning
2018 Β· Elad Sarafian, Aviv Tamar, Sarit Kraus
Abstract
We propose a policy improvement algorithm for Reinforcement Learning (RL) which is called Rerouted Behavior Improvement (RBI). RBI is designed to take into account the evaluation errors of the Q-function. Such errors are common in RL when learning the \(Q\)-value from finite past experience data. Greedy policies or even constrained policy optimization algorithms which ignore these errors may suffer from an improvement penalty (i.e. a negative policy improvement). To minimize the improvement penalty, the RBI idea is to attenuate rapid policy changes of low probability actions which were less frequently sampled. This approach is shown to avoid catastrophic performance degradation and reduce regret when learning from a batch of past experience. Through a two-armed bandit with Gaussian distributed rewards example, we show that it also increases data efficiency when the optimal action has a high variance. We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from obse
Authors
(none)
Tags
Stats
Related papers
- Blending Imitation And Reinforcement Learning For Robust Policy Improvement (2023)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- Never Worse, Mostly Better: Stable Policy Improvement In Deep Reinforcement Learning (2019)0.00
- Theoretically Guaranteed Policy Improvement Distilled From Model-based Planning (2023)2.26
- Feasible Policy Iteration For Safe Reinforcement Learning (2023)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Policy Bifurcation In Safe Reinforcement Learning (2024)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00