Recurrent Natural Policy Gradient For Pomdps
2024 Β· Semih Cayci, Atilla Eryilmaz
Abstract
Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.
Authors
(none)
Tags
Stats
Related papers
- Finite-time Analysis Of Natural Actor-critic For Pomdps (2022)0.00
- Scaling Internal-state Policy-gradient Methods For Pomdps (2025)0.00
- Finite-state Controllers For (hidden-model) Pomdps Using Deep Reinforcement Learning (2026)0.00
- Federated Natural Policy Gradient And Actor Critic Methods For Multi-task Reinforcement Learning (2023)0.00
- On The Linear Convergence Of Natural Policy Gradient Algorithm (2021)0.00
- Optimistic Natural Policy Gradient: A Simple Efficient Policy Optimization Framework For Online RL (2023)0.00
- Anchor-changing Regularized Natural Policy Gradient For Multi-objective Reinforcement Learning (2022)0.00
- A Policy Gradient Method For Confounded Pomdps (2023)0.00