Abstract

In this paper, we analyze the sample complexities of learning the optimal state-action value function \(Q^*\) and an optimal policy \(\pi^*\) in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter \(\beta\neq 0\) and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive \(Q\)-value-iteration (MB-RS-QVI) which leads to \((\epsilon,\delta)\)-PAC-bounds on \(\|Q^*-Q^k\|\), and \(\|V^*-V^\{\pi_k\}\|\) where \(Q_k\) is the output of MB-RS-QVI after k iterations and \(\pi_k\) is the greedy policy with respect to \(Q_k\). Both PAC-bounds have exponential dependence on the effective horizon \(\frac\{1\}\{1-\gamma\}\) and the strength of this dependence grows with the learners risk-sensitivity \(|\beta|\). We also provide two lower bounds which shows that exponential dependence on \(|\beta|\frac\{1\}\{1-\gamma\}\) is unavoidable in b

Authors

(none)

Tags

  • Model-Based RL

Stats

Related papers