Abstract
We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of \(O\left(D O \sqrt\{A T K_T log\left (\frac\{T\}\{\delta\} \right) + \frac\{K_T log \frac\{K_T\}\{\delta\}\}\{\min\limits_\ell \: \mathbf\{KL\}\left( \{\mathbf\{\theta\}^\{(\ell+1)\}\}\mid\mid\{\mathbf\{\theta\}^\{(\e