Score-aware Policy-gradient And Performance Guarantees Using Local Lyapunov Stability
2023 · Céline Comte, Matthieu Jonckheere, Jaron Sanders, et al.
Abstract
In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function estimation in the aforementioned setting. We show that SAGE-based policy-gradient locally converges, and we obtain its regret. This includes cases when the state space of the MDP is countable and unstable policies can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and the existence of a local Lyapunov
Authors
(none)
Tags
Stats
Related papers
- Stabilizing Policy Gradient Methods Via Reward Profiling (2025)0.00
- Smoothing Policies And Safe Policy Gradients (2019)7.50
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- Mixed Policy Gradient: Off-policy Reinforcement Learning Driven Jointly By Data And Model (2021)0.00
- Stabilizing Policy Gradients For Sample-efficient Reinforcement Learning In LLM Reasoning (2025)0.00
- Model-free Policy Learning With Reward Gradients (2021)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Strongly-polynomial Time And Validation Analysis Of Policy Gradient Methods (2024)0.00