Entropic Risk Optimization In Discounted Mdps: Sample Complexity Bounds With A Generative Model
2025 Β· Oliver Mortensen, Mohammad Sadegh Talebi
Abstract
In this paper, we analyze the sample complexities of learning the optimal state-action value function \(Q^*\) and an optimal policy \(\pi^*\) in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter \(\beta\neq 0\) and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive \(Q\)-value-iteration (MB-RS-QVI) which leads to \((\epsilon,\delta)\)-PAC-bounds on \(\|Q^*-Q^k\|\), and \(\|V^*-V^\{\pi_k\}\|\) where \(Q_k\) is the output of MB-RS-QVI after k iterations and \(\pi_k\) is the greedy policy with respect to \(Q_k\). Both PAC-bounds have exponential dependence on the effective horizon \(\frac\{1\}\{1-\gamma\}\) and the strength of this dependence grows with the learners risk-sensitivity \(|\beta|\). We also provide two lower bounds which shows that exponential dependence on \(|\beta|\frac\{1\}\{1-\gamma\}\) is unavoidable in b
Authors
(none)
Tags
Stats
Related papers
- Model-free Reinforcement Learning: From Clipped Pseudo-regret To Sample Complexity (2020)0.00
- Bayesian Risk-sensitive Policy Optimization For Mdps With General Loss Functions (2025)0.00
- Sample-efficient Reinforcement Learning For Linearly-parameterized Mdps With A Generative Model (2021)0.00
- Breaking The Sample Size Barrier In Model-based Reinforcement Learning With A Generative Model (2020)9.03
- Value-biased Maximum Likelihood Estimation For Model-based Reinforcement Learning In Discounted Linear Mdps (2023)0.00
- Q-learning With UCB Exploration Is Sample Efficient For Infinite-horizon MDP (2019)0.00
- Efficient Learning For Entropy-regularized Markov Decision Processes Via Multilevel Monte Carlo (2025)0.00
- Model-based Epistemic Variance Of Values For Risk-aware Policy Optimization (2023)0.00