Understanding Sampler Stochasticity In Training Diffusion Models For RLHF
2025 Β· Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, et al.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our finding
Authors
(none)
Tags
Stats
Related papers
- Avoiding Mode Collapse In Diffusion Models Fine-tuned With Reinforcement Learning (2024)0.00
- Using Human Feedback To Fine-tune Diffusion Models Without Any Reward Model (2023)17.39
- Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning (2026)0.00
- Diffusion Models For Reinforcement Learning: A Survey (2023)5.64
- Robust Reinforcement Learning Under Diffusion Models For Data With Jumps (2024)0.00
- Diffusion Policy Through Conditional Proximal Policy Optimization (2026)0.00
- Lrt-diffusion: Calibrated Risk-aware Guidance For Diffusion Policies (2025)0.00
- Reward-directed Score-based Diffusion Models Via Q-learning (2024)0.00