Narrowing The Gap Between Adversarial And Stochastic Mdps Via Policy Optimization
2024 Β· Daniil Tiapkin, Evgenii Chzhen, Gilles Stoltz
Abstract
We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during \(T\) episodes, each of which consists of \(H\) stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order \(\tilde\{\mathcal\{O\}\}(\mathrm\{poly\}(H)\sqrt\{SAT\})\), where \(S\) and \(A\) are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of \(\sqrt\{S\}\), bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound \(Ξ©(\sqrt\{H^3SAT\})\) as far as the dependencies in \(S,A,T\) are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynami
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Policy Optimization Algorithms For Learning Adversarial Linear Mixture Mdps (2021)0.00
- Efficient Policy Learning For Non-stationary Mdps Under Adversarial Manipulation (2019)0.00
- Near-optimal Regret Using Policy Optimization In Online Mdps With Aggregate Bandit Feedback (2025)0.00
- A Theoretical Analysis Of Optimistic Proximal Policy Optimization In Linear Markov Decision Processes (2023)0.00
- Online Convex Optimization In Adversarial Markov Decision Processes (2019)0.00
- Refined Regret For Adversarial Mdps With Linear Function Approximation (2023)0.00
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58