Near-optimal Regret For Adversarial MDP With Delayed Bandit Feedback
2022 Β· Tiancheng Jin, Tal Lancewicki, Haipeng Luo, et al.
Abstract
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode \(k\) is revealed only in the end of episode \(k + d^k\), where the delay \(d^k\) can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal \(\sqrt\{K + D\}\) regret, where \(K\) is the number of episodes and \(D = \sum_\{k=1\}^K d^k\) is the total delay, significantly improving upon the best known regret bound of \((K + D)^\{2/3\}\).
Authors
(none)
Tags
Stats
Related papers
- Learning Adversarial Markov Decision Processes With Delayed Feedback (2020)0.00
- Towards Optimal Regret In Adversarial Linear Mdps With Bandit Feedback (2023)0.00
- Online Markov Decision Processes With Aggregate Bandit Feedback (2021)0.00
- Near-optimal Regret Using Policy Optimization In Online Mdps With Aggregate Bandit Feedback (2025)0.00
- Delay-adapted Policy Optimization And Improved Regret For Adversarial MDP With Delayed Bandit Feedback (2023)0.00
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00
- Improved Algorithm For Adversarial Linear Mixture Mdps With Bandit Feedback And Unknown Transition (2024)0.00
- Optimism And Delays In Episodic Reinforcement Learning (2021)0.00