Exploration Versus Exploitation In Reinforcement Learning: A Stochastic Control Approach
2018 Β· Haoran Wang, Thaleia Zariphopoulou, Xunyu Zhou
Abstract
We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured, respectively and mutual-exclusively, by the mean and variance of the Gaussia
Authors
(none)
Tags
Stats
Related papers
- Optimal Scheduling Of Entropy Regulariser For Continuous-time Linear-quadratic Reinforcement Learning (2022)4.52
- Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach (2023)0.00
- A Comparative Theoretical Analysis Of Entropy Control Methods In Reinforcement Learning (2026)0.00
- A Random Measure Approach To Reinforcement Learning In Continuous Time (2024)0.00
- Continuous-time Risk-sensitive Reinforcement Learning Via Quadratic Variation Penalty (2024)0.00
- Sublinear Regret For A Class Of Continuous-time Linear-quadratic Reinforcement Learning Problems (2024)0.00
- Maximum Entropy Exploration Without The Rollouts (2026)0.00
- Exploration Conscious Reinforcement Learning Revisited (2018)0.00