Dual Policy Iteration
2018 Β· Wen Sun, Geoffrey J. Gordon, Byron Boots, et al.
Abstract
Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL app
Authors
(none)
Tags
Stats
Related papers
- Adaptive Approximate Policy Iteration (2020)0.00
- Iterative Amortized Policy Optimization (2020)0.00
- Addressing Action Oscillations Through Learning Policy Inertia (2021)7.81
- Multi-step Greedy Reinforcement Learning Algorithms (2019)0.00
- Modified Actor-critics (2019)0.00
- Easy Monotonic Policy Iteration (2016)0.00
- Greedification Operators For Policy Optimization: Investigating Forward And Reverse KL Divergences (2021)0.00
- Blending Imitation And Reinforcement Learning For Robust Policy Improvement (2023)0.00