Diffusionnft: Online Diffusion Reinforcement With Forward Process
2025 Β· Kaiwen Zheng, Huayu Chen, Haotian Ye, et al.
Abstract
Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization.
Authors
(none)
Tags
Stats
Related papers
- Reverse Flow Matching: A Unified Framework For Online Reinforcement Learning With Diffusion And Flow Policies (2026)0.00
- Diffusion Policy Through Conditional Proximal Policy Optimization (2026)0.00
- Diffusion Policies As An Expressive Policy Class For Offline Reinforcement Learning (2022)0.00
- Diffpo: Training Diffusion Llms To Reason Fast And Furious Via Reinforcement Learning (2025)0.00
- Diffusion Policies Creating A Trust Region For Offline Reinforcement Learning (2024)8.04
- Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning (2026)0.00
- Policy Representation Via Diffusion Probability Model For Reinforcement Learning (2023)0.00
- Preferred-action-optimized Diffusion Policies For Offline Reinforcement Learning (2024)0.00