Using Human Feedback To Fine-tune Diffusion Models Without Any Reward Model
2023 Β· Kai Yang, Jian Tao, Jiafei Lyu, et al.
Abstract
Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as t
Authors
(none)
Tags
Stats
Related papers
- Avoiding Mode Collapse In Diffusion Models Fine-tuned With Reinforcement Learning (2024)0.00
- Fine-tuning Diffusion Policies With Backpropagation Through Diffusion Timesteps (2025)0.00
- Diwa: Diffusion Policy Adaptation With World Models (2025)0.00
- Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning (2026)0.00
- Understanding Sampler Stochasticity In Training Diffusion Models For RLHF (2025)0.00
- Dichotomous Diffusion Policy Optimization (2025)0.00
- Diffusion Policy Through Conditional Proximal Policy Optimization (2026)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00