Abstract
arXiv:2603.21621v2 Announce Type: replace Abstract: Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose expressiveness can be limited in complex continuous-control tasks. Generative policies based on diffusion and flow models provide more expressive action distributions, but they naturally define distributions over multi-step denoising paths whose terminal action density is often intractable, creating a mismatch with likelihood-based on-policy proximal updates. To address this mismatch, we introduce \textbf{GSB-MDPO} (\emph{Generalized Schr\"odinger Bridge Mirror Descent Policy Optimization}), which formulates on-policy generative policy optimization as a Generalized Schr\"odinger Bridge problem over state-conditioned generation paths and instantiates the resulting path-measure update through mirror descent policy optimization. The key insight is that the GSB path-space KL plays the role of the proximal term in MDPO while upper-bounding the terminal action KL, enabling direct control of the executed action distribution without explicit terminal action likelihood evaluation. Experiments on 14 continuous-control tasks across Playground and Gym-MuJoCo demonstrate the empirical effectiveness of GSB-MDPO and support path-space regularization as a principled proximal update for multi-step generative policies.