Reinforcement Learning With Delayed, Composite, And Partially Anonymous Reward
2023 Β· Washim Uddin Mondal, Vaneet Aggarwal
Abstract
We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named \(\mathrm\{DUCRL2\}\) to obtain a near-optimal policy for this setting and show that it achieves a regret bound of \(\tilde\{\mathcal\{O\}\}\left(DS\sqrt\{AT\} + d (SA)^3\right)\) where \(S\) and \(A\) are the sizes of the state and action spaces, respectively, \(D\) is the diameter of the MDP, \(d\) is a parameter upper bounded by the maximum reward delay, and \(T\) de
Authors
(none)
Tags
Stats
Related papers
- Learning Adversarial Markov Decision Processes With Delayed Feedback (2020)0.00
- Near-optimal Regret For Adversarial MDP With Delayed Bandit Feedback (2022)0.00
- Revisiting State Augmentation Methods For Reinforcement Learning With Stochastic Delays (2021)10.35
- RUDDER: Return Decomposition For Delayed Rewards (2018)0.00
- Sharper Model-free Reinforcement Learning For Average-reward Markov Decision Processes (2023)0.00
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00
- Logarithmic Regret Bounds For Continuous-time Average-reward Markov Decision Processes (2022)5.24
- A Sharper Global Convergence Analysis For Average Reward Reinforcement Learning Via An Actor-critic Approach (2024)0.00