Optimism And Delays In Episodic Reinforcement Learning
2021 Β· Benjamin Howson, Ciara Pike-Burke, Sarah Filippi
Abstract
There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically inves
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Regret For Adversarial MDP With Delayed Bandit Feedback (2022)0.00
- Learning Adversarial Markov Decision Processes With Delayed Feedback (2020)0.00
- Logarithmic Regret Of Exploration In Average Reward Markov Decision Processes (2025)0.00
- Beyond Value-function Gaps: Improved Instance-dependent Regret Bounds For Episodic Reinforcement Learning (2021)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- The Best Of Both Worlds: Reinforcement Learning With Logarithmic Regret And Policy Switches (2022)0.00
- Provably Efficient Reinforcement Learning With Aggregated States (2019)0.00
- Delay-adapted Policy Optimization And Improved Regret For Adversarial MDP With Delayed Bandit Feedback (2023)0.00