Posterior Sampling For Reinforcement Learning: Worst-case Regret Bounds
2017 Β· Shipra Agrawal, Randy Jia
Abstract
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of \(\tilde\{O\}(DS\sqrt\{AT\})\) for any communicating MDP with \(S\) states, \(A\) actions and diameter \(D\). Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon \(T\). This result closely matches the known lower bound of \(Ξ©(\sqrt\{DSAT\})\). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
Authors
(none)
Tags
Stats
Related papers
- Efficient Exploration In Average-reward Constrained Reinforcement Learning: Achieving Near-optimal Regret With Posterior Sampling (2024)0.00
- Provably Efficient Exploration In Constrained Reinforcement Learning:posterior Sampling Is All You Need (2023)0.00
- Optimistic Posterior Sampling For Reinforcement Learning With Few Samples And Tight Guarantees (2022)0.00
- Minimax Regret Bounds For Reinforcement Learning (2017)0.00
- Variance-aware Regret Bounds For Undiscounted Reinforcement Learning In Mdps (2018)0.00
- Regret Analysis In Deterministic Reinforcement Learning (2021)0.00
- Why Is Posterior Sampling Better Than Optimism For Reinforcement Learning? (2016)0.00
- A Provably Efficient Model-free Posterior Sampling Method For Episodic Reinforcement Learning (2022)0.00