Learning Self-imitating Diverse Policies
2018 Β· Tanmay Gangwani, Qiang Liu, Jian Peng
Abstract
The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each poli
Authors
(none)
Tags
Stats
Related papers
- Reward-conditioned Policies (2019)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Near-future Policy Optimization (2026)0.00
- Memory Based Trajectory-conditioned Policies For Learning From Sparse Rewards (2019)0.00
- Model-free Policy Learning With Reward Gradients (2021)0.00
- Intrinsic Reward Policy Optimization For Sparse-reward Environments (2026)0.00
- Evolution-guided Policy Gradient In Reinforcement Learning (2018)0.00
- Replacing Rewards With Examples: Example-based Policy Search Via Recursive Classification (2021)0.00