\(f\)-policy Gradients: A General Framework For Goal Conditioned RL Using \(f\)-divergences

Abstract

Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called \(f\)-Policy Gradients, or \(f\)-PG. \(f\)-PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in sparse reward settings. We further introduce an entropy-regularized policy optimization object

\(f\)-policy Gradients: A General Framework For Goal Conditioned RL Using \(f\)-divergences

Abstract

Authors

Tags

Stats

Related papers