Distillation Policy Optimization
2023 Β· Jianfei Ma
Abstract
While on-policy algorithms are known for their stability, they often demand a substantial number of samples. In contrast, off-policy algorithms, which leverage past experiences, are considered sample-efficient but tend to exhibit instability. Can we develop an algorithm that harnesses the benefits of off-policy data while maintaining stable learning? In this paper, we introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control, facilitating rapid learning and adaptable integration with on-policy algorithms. This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline, improving the efficacy of both on- and off-policy learning. Our empirical results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches. It demonstrates the promise of our approach as a novel learning paradigm.
Authors
(none)
Tags
Stats
Related papers
- Sample-efficient Model-free Reinforcement Learning With Off-policy Critics (2019)9.60
- Discriminator-actor-critic: Addressing Sample Inefficiency And Reward Bias In Adversarial Imitation Learning (2018)0.00
- Doubly Robust Off-policy Actor-critic Algorithms For Reinforcement Learning (2019)0.00
- Mitigating Off-policy Bias In Actor-critic Methods With One-step Q-learning: A Novel Correction Approach (2022)0.00
- Stable And Efficient Policy Evaluation (2020)0.00
- Neural Network Compatible Off-policy Natural Actor-critic Algorithm (2021)0.00
- An Approximate Policy Iteration Viewpoint Of Actor-critic Algorithms (2022)2.26
- How To Learn A Useful Critic? Model-based Action-gradient-estimator Policy Optimization (2020)0.00