Model-free Policy Learning With Reward Gradients
2021 Β· Qingfeng Lan, Samuele Tosatto, Homayoon Farrahi, et al.
Abstract
Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the \textit\{Reward Policy Gradient\} estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on differen
Authors
(none)
Tags
Stats
Related papers
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- An Empirical Analysis Of Measure-valued Derivatives For Policy Gradients (2021)0.00
- An Analysis Of Measure-valued Derivatives For Policy Gradients (2022)2.26
- On The Model-based Stochastic Value Gradient For Continuous Reinforcement Learning (2020)0.00
- Batch Reinforcement Learning With A Nonparametric Off-policy Policy Gradient (2020)0.00
- Stabilizing Policy Gradient Methods Via Reward Profiling (2025)0.00
- Learning Self-imitating Diverse Policies (2018)0.00
- Smoothing Policies And Safe Policy Gradients (2019)7.50