Gap-dependent Unsupervised Exploration For Reinforcement Learning
2021 Β· Jingfeng Wu, Vladimir Braverman, Lin F. Yang
Abstract
For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is \(\propto\widetilde\{\mathcal\{O\}\}(1/\epsilon^2)\), where \(\epsilon>0\) is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, \(\rho>0\), to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only \(\widetilde\{\mathcal\{O\}\} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )\) episodes of exploration, and is able to obtain an \(\epsilon\)-optimal policy for a post-reveale
Authors
(none)
Tags
Stats
Related papers
- Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning (2023)0.00
- Provably Efficient Exploration For Reinforcement Learning Using Unsupervised Learning (2020)0.00
- Fast Rates For Maximum Entropy Exploration (2023)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Improved Bounds For Reward-agnostic And Reward-free Exploration (2026)0.00
- Exploration In Feature Space For Reinforcement Learning (2017)0.00
- Neighboring State-based Exploration For Reinforcement Learning (2022)0.00
- Strategically Efficient Exploration In Competitive Multi-agent Reinforcement Learning (2021)0.00