Is Offline Decision Making Possible With Only Few Samples? Reliable Decisions In Data-starved Bandits Via Trust Region Enhancement
2024 Β· Ruiqi Zhang, Yuexiang Zhai, Andrea Zanette
Abstract
What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples. Our analysis reveals that *stochastic policies can be substantially better* than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower o
Authors
(none)
Tags
Stats
Related papers
- A Closer Look At The Worst-case Behavior Of Multi-armed Bandit Algorithms (2021)0.00
- Unlearning Offline Stochastic Multi-armed Bandits (2026)0.00
- Distributionally Robust Model-based Offline Reinforcement Learning With Near-optimal Sample Complexity (2022)0.00
- A Frequency-domain Analysis Of The Multi-armed Bandit Problem: A New Perspective On The Exploration-exploitation Trade-off (2025)0.00
- On-line Learning In Tree Mdps By Treating Policies As Bandit Arms (2026)0.00
- Uncertainty-aware Policy Optimization: A Robust, Adaptive Trust Region Approach (2020)0.00
- Bayesian Bandits: Balancing The Exploration-exploitation Tradeoff Via Double Sampling (2017)0.00
- Online Learning For Cooperative Multi-player Multi-armed Bandits (2021)5.24