Near-optimal Regret Using Policy Optimization In Online Mdps With Aggregate Bandit Feedback
2025 Β· Tal Lancewicki, Yishay Mansour
Abstract
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first \textit\{optimal\} regret bound of \(\tilde \Theta(H^2\sqrt\{SAK\})\), where \(K\) is the number of episodes, \(H\) is the episode horizon, \(S\) is the number of states, and \(A\) is the number of actions. In the unknown dynamics case we establish regret bound of \(\tilde O(H^3 S \sqrt\{AK\})\), significantly improving the best known result by a factor of \(H^2 S^5 A^2\).
Authors
(none)
Tags
Stats
Related papers
- Towards Optimal Regret In Adversarial Linear Mdps With Bandit Feedback (2023)0.00
- Online Markov Decision Processes With Aggregate Bandit Feedback (2021)0.00
- Rate-optimal Policy Optimization For Linear Markov Decision Processes (2023)0.00
- Near-optimal Regret For Adversarial MDP With Delayed Bandit Feedback (2022)0.00
- Delay-adapted Policy Optimization And Improved Regret For Adversarial MDP With Delayed Bandit Feedback (2023)0.00
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00
- Narrowing The Gap Between Adversarial And Stochastic Mdps Via Policy Optimization (2024)0.00
- Learning Adversarial Markov Decision Processes With Delayed Feedback (2020)0.00