The Effective Horizon Explains Deep RL Performance In Stochastic Environments
2023 Β· Cassidy Laidlaw, Banghua Zhu, Stuart Russell, et al.
Abstract
Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that sat
Authors
(none)
Tags
Stats
Related papers
- When Simple Exploration Is Sample Efficient: Identifying Sufficient Conditions For Random Exploration To Yield PAC RL Algorithms (2018)0.00
- Settling The Horizon-dependence Of Sample Complexity In Reinforcement Learning (2021)3.58
- Model-agnostic Solutions For Deep Reinforcement Learning In Non-ergodic Contexts (2026)0.00
- Deep Reinforcement Learning At The Edge Of The Statistical Precipice (2021)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Moments Matter:stabilizing Policy Optimization Using Return Distributions (2026)0.00
- SUNRISE: A Simple Unified Framework For Ensemble Learning In Deep Reinforcement Learning (2020)0.00
- Provably Efficient And Agile Randomized Q-learning (2025)0.00