Policy Gradient For Continuing Tasks In Non-stationary Markov Decision Processes
2020 Β· Santiago Paternain, Juan Andres Bazerque, Alejandro Ribeiro
Abstract
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed. Hence preventing these algorithms to be fully implemented online, which is a desirable property for systems that need to adapt to new tasks and/or environments in deployment. The main requirement for a policy gradient algorithm to work is that the estimate of the gradient at any point in time is an ascent direction for the initial value function. In this work we establish that indeed this is the case which enab
Authors
(none)
Tags
Stats
Related papers
- The Reinforce Policy Gradient Algorithm Revisited (2023)0.00
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- A Policy Gradient Approach For Finite Horizon Constrained Markov Decision Processes (2022)3.58
- Deterministic Policy Gradient For Reinforcement Learning With Continuous Time And State (2025)0.00
- Policy Gradient Using Weak Derivatives For Reinforcement Learning (2020)0.00
- On The Linear Convergence Of Natural Policy Gradient Algorithm (2021)0.00
- Policy Gradient Algorithms With Monte Carlo Tree Learning For Non-markov Decision Processes (2022)0.00
- Why Policy Gradient Algorithms Work For Undiscounted Total-reward Mdps (2025)0.00