← all papers · overview

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

Abstract

We introduce \textbf{QuantFPFlow}, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker--Planck~(FP) formulation of stochastic policy optimisation. Classical continuous-space RL agents must estimate the FP partition function Z=eV(x)/DdxZ = \int e^{-V(\mathbf{x})/D}\,d\mathbf{x} at cost \calO(1/ε2)\calO(1/\varepsilon^{2}); QuantFPFlow replaces this with a Grover-amplified amplitude estimator achieving \calO(1/ε)\calO(1/\varepsilon) -- a provable quadratic speedup. While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the \calO(1/ε)\calO(1/\varepsilon) algorithmic structure. The estimated stationary distribution \rhostar\rhostar drives a theoretically grounded exploration bonus \Raug=\Renv+αlog(1/\rhostar(s))\Raug = \Renv + \alpha\log(1/\rhostar(s)). This bonus steers the agent toward globally optimal regions of multimodal reward landscapes while simultaneously constraining policy variance through FP diffusion matching. On a continuous-control task specifically designed to expose local-optima failure, QuantFPFlow achieves mean reward 1,295.7±423.21{,}295.7 \pm 423.2 versus 1,284.0±474.01{,}284.0 \pm 474.0 for Soft Actor-Critic~(SAC), while discovering the global optimum \textbf{10.4\,\% more frequently} (33.9\,\% vs.\ 30.7\,\%). Policy entropy remains near H(π)6.5H(\pi)\approx 6.5\,nats throughout training, whereas SAC collapses to 1.51.5\,nats, confirming that FP diffusion matching actively prevents premature convergence. Dimensionality experiments further show computational scaling of \calO(d0.35)\calO(d^{0.35}) for QuantFPFlow versus \calO(d0.76)\calO(d^{0.76}) for classical FP estimation.

Related papers

Ranked by semantic similarity — how closely each paper's abstract matches this one (100% = near-identical topic).