A Policy Gradient Method For Confounded Pomdps
2023 Β· Mao Hong, Zhengling Qi, Yanxun Xu
Abstract
In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-depe
Authors
(none)
Tags
Stats
Related papers
- A Study Of Policy Gradient On A Class Of Exactly Solvable Models (2020)0.00
- Scaling Internal-state Policy-gradient Methods For Pomdps (2025)0.00
- Sequential Monte Carlo For Policy Optimization In Continuous Pomdps (2025)0.00
- A Minimax Learning Approach To Off-policy Evaluation In Confounded Partially Observable Markov Decision Processes (2021)0.00
- Reinforcement Learning With Continuous Actions Under Unmeasured Confounding (2025)0.00
- Policy Gradient In Partially Observable Environments: Approximation And Convergence (2018)0.00
- Recurrent Natural Policy Gradient For Pomdps (2024)0.00
- Statistical Tractability Of Off-policy Evaluation Of History-dependent Policies In Pomdps (2025)0.00