Improved Bounds For Reward-agnostic And Reward-free Exploration
2026 Β· Oran Ridel, Alon Cohen
Abstract
We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable \(\epsilon\)-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets \(\epsilon\)-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter \(\epsilon\). We propose a new algorithm that significantly relaxes the requirement on \(\epsilon\). Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an \(\epsilon\)-optimal policy once the reward is reveal
Authors
(none)
Tags
Stats
Related papers
- Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning (2023)0.00
- Provable Cooperative Multi-agent Exploration For Reward-free Mdps (2026)0.00
- Conservative Exploration In Reinforcement Learning (2020)0.00
- Near-optimal Conservative Exploration In Reinforcement Learning Under Episode-wise Constraints (2023)0.00
- Optimal Horizon-free Reward-free Exploration For Linear Mixture Mdps (2023)0.00
- Provably Efficient Maximum Entropy Exploration (2018)0.00
- Fast Rates For Maximum Entropy Exploration (2023)0.00
- Provably Efficient Reward-agnostic Navigation With Linear Value Iteration (2020)0.00