Finite-time Analysis Of Natural Actor-critic For Pomdps
2022 Β· Semih Cayci, Niao He, R. Srikant
Abstract
We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-sta
Authors
(none)
Tags
Stats
Related papers
- Finite-time Analysis Of Single-timescale Actor-critic (2022)0.00
- Recurrent Natural Policy Gradient For Pomdps (2024)0.00
- Finite-sample Analysis Of Off-policy Natural Actor-critic With Linear Function Approximation (2021)0.00
- Convergence Proof For Actor-critic Methods Applied To PPO And RUDDER (2020)11.67
- Provably Efficient Reinforcement Learning In Partially Observable Dynamical Systems (2022)0.00
- A Finite Time Analysis Of Two Time-scale Actor Critic Methods (2020)0.00
- Convergence Of Finite Memory Q-learning For Pomdps And Near Optimality Of Learned Policies Under Filter Stability (2021)0.00
- Finite Sample Analysis Of Two-time-scale Natural Actor-critic Algorithm (2021)7.50