Anti-concentrated Confidence Bonuses For Scalable Exploration
2021 Β· Jordan T. Ash, Cyril Zhang, Surbhi Goel, et al.
Abstract
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits literature, prescribes an elliptical bonus which addresses the challenge of leveraging shared information in large action spaces. This bonus scheme cannot be directly transferred to high-dimensional exploration problems, however, due to the computational cost of maintaining the inverse covariance matrix of action features. We introduce *anti-concentrated confidence bounds* for efficiently approximating the elliptical bonus, using an ensemble of regressors trained to predict random noise from policy network-derived features. Using this approximation, we obtain stochastic linear bandit algorithms which obtain \(\tilde O(d \sqrt\{T\})\) regret bounds for \(\mathrm\{poly\}(d)\) fixed actions. We develop a
Authors
(none)
Tags
Stats
Related papers
- Context-dependent Upper-confidence Bounds For Directed Exploration (2018)0.00
- Tightening Exploration In Upper Confidence Reinforcement Learning (2020)0.00
- Unified Framework Of Distributional Regret In Multi-armed Bandits And Reinforcement Learning (2026)0.00
- Exploration Via Elliptical Episodic Bonuses (2022)3.58
- The Uncertainty Bellman Equation And Exploration (2017)0.00
- Information-directed Exploration For Deep Reinforcement Learning (2018)0.00
- Exploring Unknown States With Action Balance (2020)0.00
- Concave Statistical Utility Maximization Bandits Via Influence-function Gradients (2026)0.00