Context-dependent Upper-confidence Bounds For Directed Exploration
2018 Β· Raksha Kumaraswamy, Matthew Schlegel, Adam White, et al.
Abstract
Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like \epsilon-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches -- because they summarize past interactions -- with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-depe
Authors
(none)
Tags
Stats
Related papers
- Information-directed Exploration For Deep Reinforcement Learning (2018)0.00
- Exploration Conscious Reinforcement Learning Revisited (2018)0.00
- Anti-concentrated Confidence Bonuses For Scalable Exploration (2021)0.00
- Improved Bounds For Reward-agnostic And Reward-free Exploration (2026)0.00
- Dynamic Subgoal-based Exploration Via Bayesian Optimization (2019)0.00
- Directed Exploration In Reinforcement Learning From Linear Temporal Logic (2024)0.00
- Conservative Exploration In Reinforcement Learning (2020)0.00
- On Optimistic Versus Randomized Exploration In Reinforcement Learning (2017)0.00