Reward-conditioned Policies
2019 Β· Aviral Kumar, Xue Bin Peng, Sergey Levine
Abstract
Reinforcement learning offers the promise of automating the acquisition of complex behavioral skills. However, compared to commonly used and well-understood supervised learning methods, reinforcement learning algorithms can be brittle, difficult to use and tune, and sensitive to seemingly innocuous implementation decisions. In contrast, imitation learning utilizes standard and well-understood supervised learning methods, but requires near-optimal expert data. Can we learn effective policies via supervised learning without demonstrations? The main idea that we explore in this work is that non-expert trajectories collected from sub-optimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory. By then conditioning the policy on the numerical value of the reward, we can obtain a policy that generalizes to larger returns. We show how such an approach can be derived as a principled method for policy search, discuss
Authors
(none)
Tags
Stats
Related papers
- Learning Self-imitating Diverse Policies (2018)0.00
- Replacing Rewards With Examples: Example-based Policy Search Via Recursive Classification (2021)0.00
- Learning Safe Policies With Expert Guidance (2018)0.00
- Post-convergence Sim-to-real Policy Transfer: A Principled Alternative To Cherry-picking (2025)0.00
- Imitating Past Successes Can Be Very Suboptimal (2022)0.00
- Policy Gradient From Demonstration And Curiosity (2020)0.00
- Learning To Reach Goals Via Iterated Supervised Learning (2019)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00