Curious Explorer: A Provable Exploration Strategy In Policy Learning

Abstract

Having access to an exploring restart distribution (the so-called wide coverage assumption) is critical with policy gradient methods. This is due to the fact that, while the objective function is insensitive to updates in unlikely states, the agent may still need improvements in those states in order to reach a nearly optimal payoff. For this reason, wide coverage is used in some form when analyzing theoretical properties of practical policy gradient methods. However, this assumption can be unfeasible in certain environments, for instance when learning is online, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms can have very poor convergence properties and sample efficiency. In this paper, we develop Curious Explorer, a novel and simple iterative state space exploration strategy that can be used with any starting distribution \(\rho\). Curious Explorer starts from \(\rho\), then using intrinsic rewards assigned to the s

Curious Explorer: A Provable Exploration Strategy In Policy Learning

Abstract

Authors

Tags

Stats

Related papers