Evolving Diffusion And Flow Matching Policies For Online Reinforcement Learning
2025 Β· Chubin Zhang, Zhenglin Wan, Feng Chen, et al.
Abstract
Diffusion and flow matching policies offer expressive, multimodal action modeling, yet they are frequently unstable in online reinforcement learning (RL) due to intractable likelihoods and gradients propagating through long sampling chains. Conversely, tractable parameterizations such as Gaussians lack the expressiveness needed for complex control -- exposing a persistent tension between optimization stability and representational power. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithm-agnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Using a two-timescale alternating schedule and anchoring decoder refinement to a fixed prior, GoRL enables stable optimization while continuously expanding expressiveness.
Authors
(none)
Tags
Stats
Related papers
- Reverse Flow Matching: A Unified Framework For Online Reinforcement Learning With Diffusion And Flow Policies (2026)0.00
- Genpo: Generative Diffusion Models Meet On-policy Reinforcement Learning (2025)0.00
- Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective (2024)0.00
- FM-IRL: Flow-matching For Reward Modeling And Policy Regularization In Reinforcement Learning (2025)0.00
- Diffusion Policies As An Expressive Policy Class For Offline Reinforcement Learning (2022)0.00
- Diffusion Policy Through Conditional Proximal Policy Optimization (2026)0.00
- Guided Flow Policy: Learning From High-value Actions In Offline Reinforcement Learning (2025)0.00
- Value-guidance Meanflow For Offline Multi-agent Reinforcement Learning (2026)0.00