A Joint Imitation-reinforcement Learning Framework For Reduced Baseline Regret

Abstract

In various control task domains, existing controllers provide a baseline level of performance that -- though possibly suboptimal -- should be maintained. Reinforcement learning (RL) algorithms that rely on extensive exploration of the state and action space can be used to optimize a control policy. However, fully exploratory RL algorithms may decrease performance below a baseline level during training. In this paper, we address the issue of online optimization of a control policy while minimizing regret w.r.t a baseline policy performance. We present a joint imitation-reinforcement learning framework, denoted JIRL. The learning process in JIRL assumes the availability of a baseline policy and is designed with two objectives in mind \textbf\{(a)\} leveraging the baseline's online demonstrations to minimize the regret w.r.t the baseline policy during training, and \textbf\{(b)\} eventually surpassing the baseline performance. JIRL addresses these objectives by initially learning to imita

A Joint Imitation-reinforcement Learning Framework For Reduced Baseline Regret

Abstract

Authors

Tags

Stats

Related papers