GeMPO: Generalized Measure Matching for Online Diffusion Reinforcement Learning

Abstract

arXiv:2603.10250v2 Announce Type: replace Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over samples from the behavior policy, which often induces an overgreedy policy and fails to utilize feedback from negative samples. In this work, we introduce GeMPO, a simple and unified framework that generalizes reweighting scheme in diffusion RL from softmax to general monotonic functions. GeMPO revisits diffusion RL via a measure matching perspective: First, we construct a virtual target policy measure via solving a regularized policy optimization objective; Second, we minimize the divergence between the current policy and this target measure through reweighted flow matching. This formulation offers two key advantages: i) It extends weight design beyond traditional exponential reweighting, allowing it to be tailored to diverse reward landscapes; and ii) by relaxing the non-negativity constraint on the target measure, our framework provides a principled justification for negative reweighting. We provide interpretations of how negative reweighting actively repels the policy from suboptimal actions and thus facilitates exploration. Extensive empirical evaluations demonstrate that GeMPO achieves competitive or superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods in practice.

Abstract

Related papers