F-divergence Constrained Policy Improvement

Abstract

To ensure stability of learning, state-of-the-art generalized policy iteration algorithms augment the policy improvement step with a trust region constraint bounding the information loss. The size of the trust region is commonly determined by the Kullback-Leibler (KL) divergence, which not only captures the notion of distance well but also yields closed-form solutions. In this paper, we consider a more general class of f-divergences and derive the corresponding policy update rules. The generic solution is expressed through the derivative of the convex conjugate function to f and includes the KL solution as a special case. Within the class of f-divergences, we further focus on a one-parameter family of $α$ -divergences to study effects of the choice of divergence on policy improvement. Previously known as well as new policy updates emerge for different values of $α$ . We show that every type of policy update comes with a compatible policy evaluation resulting from the chosen

Abstract

Related papers