Imitating Past Successes Can Be Very Suboptimal
2022 Β· Benjamin Eysenbach, Soumith Udatha, Sergey Levine, et al.
Abstract
Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we formally relate outcome-conditioned imitation learning to reward maximization, drawing a precise relationship between the learned policy and Q-values and explaining the close connections between these methods and prior EM-based policy search methods. This analysis shows that existing outcome-conditioned imitation learning methods do not necessarily improve the policy, but a simple modification results in a method that does guarantee policy improvement, under some assumptions.
Authors
(none)
Tags
Stats
Related papers
- Reward-conditioned Policies (2019)0.00
- Learning Self-imitating Diverse Policies (2018)0.00
- Seizing Serendipity: Exploiting The Value Of Past Success In Off-policy Actor-critic (2023)0.00
- Replacing Rewards With Examples: Example-based Policy Search Via Recursive Classification (2021)0.00
- Rewriting History With Inverse RL: Hindsight Inference For Policy Improvement (2020)0.00
- Success Conditioning As Policy Improvement: The Optimization Problem Solved By Imitating Success (2026)0.00
- Value Enhancement Of Reinforcement Learning Via Efficient And Robust Trust Region Optimization (2023)0.00
- Imitating Opponent To Win: Adversarial Policy Imitation Learning In Two-player Competitive Games (2022)0.00