BOTS: Batch Bayesian Optimization Of Extended Thompson Sampling For Severely Episode-limited RL Settings
2024 Β· Karine Karine, Susan A. Murphy, Benjamin M. Marlin
Abstract
In settings where the application of reinforcement learning (RL) requires running real-world trials, including the optimization of adaptive health interventions, the number of episodes available for learning can be severely limited due to cost or time constraints. In this setting, the bias-variance trade-off of contextual bandit methods can be significantly better than that of more complex full RL methods. However, Thompson sampling bandits are limited to selecting actions based on distributions of immediate rewards. In this paper, we extend the linear Thompson sampling bandit to select actions based on a state-action utility function consisting of the Thompson sampler's estimate of the expected immediate reward combined with an action bias term. We use batch Bayesian optimization over episodes to learn the action bias terms with the goal of maximizing the expected return of the extended Thompson sampler. The proposed approach is able to learn optimal policies for a strictly broader cl
Authors
(none)
Tags
Stats
Related papers
- Langevin Thompson Sampling With Logarithmic Communication: Bandits And Reinforcement Learning (2023)0.00
- Bayesian Bandits: Balancing The Exploration-exploitation Tradeoff Via Double Sampling (2017)0.00
- Deep Bayesian Bandits Showdown: An Empirical Comparison Of Bayesian Deep Networks For Thompson Sampling (2018)0.00
- A Provably Efficient Model-free Posterior Sampling Method For Episodic Reinforcement Learning (2022)0.00
- Policy Gradient Optimization Of Thompson Sampling Policies (2020)0.00
- Beyond Variance Reduction: Understanding The True Impact Of Baselines On Policy Optimization (2020)0.00
- Provably Efficient And Agile Randomized Q-learning (2025)0.00
- Online Bayesian Risk-averse Reinforcement Learning (2025)0.00