One Policy Is Enough: Parallel Exploration With A Single Policy Is Near-optimal For Reward-free Reinforcement Learning
2022 Β· Pedro Cisneros-Velarde, Boxiang Lyu, Sanmi Koyejo, et al.
Abstract
Although parallelism has been extensively used in reinforcement learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
Authors
(none)
Tags
Stats
Related papers
- Faster Last-iterate Convergence Of Policy Optimization In Zero-sum Markov Games (2022)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Probabilistic Insights For Efficient Exploration Strategies In Reinforcement Learning (2025)0.00
- PC-MLP: Model-based Reinforcement Learning With Policy Cover Guided Exploration (2021)0.00
- Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning (2023)0.00
- Provable Cooperative Multi-agent Exploration For Reward-free Mdps (2026)0.00
- Strategically Efficient Exploration In Competitive Multi-agent Reinforcement Learning (2021)0.00