Variational Policy Gradient Method For Reinforcement Learning With General Utilities
2020 Β· Junyu Zhang, Alec Koppel, Amrit Singh Bedi, et al.
Abstract
In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite\{sutton2000policy\} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Car
Authors
(none)
Tags
Stats
Related papers
- Policy Gradient For Reinforcement Learning With General Utilities (2022)0.00
- On The Global Optimality Of Policy Gradient Methods In General Utility Reinforcement Learning (2024)0.00
- Policy Gradient Using Weak Derivatives For Reinforcement Learning (2020)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Why Policy Gradient Algorithms Work For Undiscounted Total-reward Mdps (2025)0.00
- Model-free Policy Learning With Reward Gradients (2021)0.00
- Policy Optimization Over General State And Action Spaces (2022)0.00
- \(f\)-policy Gradients: A General Framework For Goal Conditioned RL Using \(f\)-divergences (2023)0.00