Conservative Dual Policy Optimization For Efficient Model-based Reinforcement Learning
2022 Β· Shenao Zhang
Abstract
Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering more stability. A conservative range of randomness is guaranteed by maximizing the expectation of model
Authors
(none)
Tags
Stats
Related papers
- Conservative Optimistic Policy Optimization Via Multiple Importance Sampling (2021)0.00
- Deep Model-based Reinforcement Learning Via Estimated Uncertainty And Conservative Policy Optimization (2019)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- How To Fine-tune The Model: Unified Model Shift And Model Bias Policy Optimization (2023)0.00
- When To Update Your Model: Constrained Model-based Reinforcement Learning (2022)2.26
- Double Horizon Model-based Policy Optimization (2025)0.00
- Towards Causal Model-based Policy Optimization (2025)0.00
- Efficient Model-based Reinforcement Learning Through Optimistic Policy Search And Planning (2020)0.00