Sublinear Regret For A Class Of Continuous-time Linear-quadratic Reinforcement Learning Problems

Abstract

We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions, where states are scalar-valued and running control rewards are absent but volatilities of the state processes depend on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an RL algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of an exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of \(O(N^\{\frac\{3\}\{4\}\})\) up to a logarithmic factor, where \(N\) is the number of learning episodes. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparis

Sublinear Regret For A Class Of Continuous-time Linear-quadratic Reinforcement Learning Problems

Abstract

Authors

Tags

Stats

Related papers