Thompson Sampling For Infinite-horizon Discounted Decision Processes
2024 Β· Daniel Adelman, Cagla Keceli, Alba V. Olivares-Nadal
Abstract
This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable
Authors
(none)
Tags
Stats
Related papers
- Posterior Sampling For Reinforcement Learning: Worst-case Regret Bounds (2017)0.00
- Provably Efficient Exploration In Constrained Reinforcement Learning:posterior Sampling Is All You Need (2023)0.00
- Efficient Exploration In Average-reward Constrained Reinforcement Learning: Achieving Near-optimal Regret With Posterior Sampling (2024)0.00
- Bayesian Learning Of Optimal Policies In Markov Decision Processes With Countably Infinite State-space (2023)0.00
- A Provably Efficient Model-free Posterior Sampling Method For Episodic Reinforcement Learning (2022)0.00
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00
- Langevin Thompson Sampling With Logarithmic Communication: Bandits And Reinforcement Learning (2023)0.00
- Model-free Reinforcement Learning: From Clipped Pseudo-regret To Sample Complexity (2020)0.00