Policy Improvement Via Imitation Of Multiple Oracles
2020 Β· Ching-An Cheng, Andrey Kolobov, Alekh Agarwal
Abstract
Despite its promise, reinforcement learning's real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle's policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies' values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimization to online learning, we introduce a novel IL algorithm MAMBA
Authors
(none)
Tags
Stats
Related papers
- Active Policy Improvement From Multiple Black-box Oracles (2023)0.00
- Blending Imitation And Reinforcement Learning For Robust Policy Improvement (2023)0.00
- Accelerating Imitation Learning With Predictive Models (2018)0.00
- Sample-efficient Multi-objective Learning Via Generalized Policy Improvement Prioritization (2023)5.24
- Some Supervision Required: Incorporating Oracle Policies In Reinforcement Learning Via Epistemic Uncertainty Metrics (2022)0.00
- Model-based Reinforcement Learning With Double Oracle Efficiency In Policy Optimization And Offline Estimation (2026)0.00
- Explaining Fast Improvement In Online Imitation Learning (2020)0.00
- Learning Self-imitating Diverse Policies (2018)0.00