Policy Optimization Via Adv2: Adversarial Learning On Advantage Functions
2023 Β· Matthieu Jonckheere, Chiara Mignacco, Gilles Stoltz
Abstract
We revisit the reduction of learning in adversarial Markov decision processes [MDPs] to adversarial learning based on \(Q\)--values; this reduction has been considered in a number of recent articles as one building block to perform policy optimization. Namely, we first consider and extend this reduction in an ideal setting where an oracle provides value functions: it may involve any adversarial learning strategy (not just exponential weights) and it may be based indifferently on \(Q\)--values or on advantage functions. We then present two extensions: on the one hand, convergence of the last iterate for a vast class of adversarial learning strategies (again, not just exponential weights), satisfying a property called monotonicity of weights; on the other hand, stronger regret criteria for learning in MDPs, inherited from the stronger regret criteria of adversarial learning called strongly adaptive regret and tracking regret. Third, we demonstrate how adversarial learning, also referred
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Policy Optimization Algorithms For Learning Adversarial Linear Mixture Mdps (2021)0.00
- Narrowing The Gap Between Adversarial And Stochastic Mdps Via Policy Optimization (2024)0.00
- Minimax Weight And Q-function Learning For Off-policy Evaluation (2019)0.00
- Refined Regret For Adversarial Mdps With Linear Function Approximation (2023)0.00
- \(\sqrt{n}\)-regret For Learning In Markov Decision Processes With Function Approximation And Low Bellman Rank (2019)0.00
- Learning In Markov Games With Adaptive Adversaries: Policy Regret, Fundamental Barriers, And Efficient Algorithms (2024)0.00
- AM-PPO: (advantage) Alpha-modulation With Proximal Policy Optimization (2025)0.00
- Delay-adapted Policy Optimization And Improved Regret For Adversarial MDP With Delayed Bandit Feedback (2023)0.00