Learning A Diffusion Model Policy From Rewards Via Q-score Matching
2023 Β· Michael Psenka, Alejandro Escontrela, Pieter Abbeel, et al.
Abstract
Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. Notably, this algorithm only needs to differentiate through the denoising model rather than the entire diffusion model evaluation, and converged policies through Q-score matching are i
Authors
(none)
Tags
Stats
Related papers
- Reward-directed Score-based Diffusion Models Via Q-learning (2024)0.00
- Diffusion Policies As An Expressive Policy Class For Offline Reinforcement Learning (2022)0.00
- Diffusion Actor-critic: Formulating Constrained Policy Iteration As Diffusion Noise Regression For Offline Reinforcement Learning (2024)2.92
- IDQL: Implicit Q-learning As An Actor-critic Method With Diffusion Policies (2023)0.00
- Contractive Diffusion Policies: Robust Action Diffusion Via Contractive Score-based Sampling With Differential Equations (2026)0.00
- Boosting Continuous Control With Consistency Policy (2023)3.58
- Distributional Soft Actor-critic With Diffusion Policy (2025)0.00
- Diffusion Policies Creating A Trust Region For Offline Reinforcement Learning (2024)8.04