Langevin Thompson Sampling With Logarithmic Communication: Bandits And Reinforcement Learning
2023 Β· Amin Karbasi, Nikki Lijing Kuang, Yi-An Ma, et al.
Abstract
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched \(\textit\{Langevin Thompson Sampling\}\) algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs
Authors
(none)
Tags
Stats
Related papers
- BOTS: Batch Bayesian Optimization Of Extended Thompson Sampling For Severely Episode-limited RL Settings (2024)0.00
- Bayesian Bandits: Balancing The Exploration-exploitation Tradeoff Via Double Sampling (2017)0.00
- A Change-detection Based Thompson Sampling Framework For Non-stationary Bandits (2020)0.00
- A Provably Efficient Model-free Posterior Sampling Method For Episodic Reinforcement Learning (2022)0.00
- More Efficient Randomized Exploration For Reinforcement Learning Via Approximate Sampling (2024)0.00
- Deep Bayesian Bandits Showdown: An Empirical Comparison Of Bayesian Deep Networks For Thompson Sampling (2018)0.00
- Thompson Sampling For Infinite-horizon Discounted Decision Processes (2024)0.00
- Analysis Of Thompson Sampling For Controlling Unknown Linear Diffusion Processes (2022)0.00