Model-free Reinforcement Learning: From Clipped Pseudo-regret To Sample Complexity
2020 Β· Zihan Zhang, Yuan Zhou, Xiangyang Ji
Abstract
In this paper we consider the problem of learning an \(\epsilon\)-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with \(S\) states, \(A\) actions, the discount factor \(\gamma \in (0,1)\), and an approximation threshold \(\epsilon > 0\), we provide a model-free algorithm to learn an \(\epsilon\)-optimal policy with sample complexity \(\tilde\{O\}(\frac\{SA\ln(1/p)\}\{\epsilon^2(1-\gamma)^\{5.5\}\})\) (where the notation \(\tilde\{O\}(\cdot)\) hides poly-logarithmic factors of \(S,A,1/(1-\gamma)\), and \(1/\epsilon\)) and success probability \((1-p)\). For small enough \(\epsilon\), we show an improved algorithm with sample complexity \(\tilde\{O\}(\frac\{SA\ln(1/p)\}\{\epsilon^2(1-\gamma)^\{3\}\})\). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on \(S\), our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
Authors
(none)
Tags
Stats
Related papers
- Near Sample-optimal Reduction-based Policy Learning For Average Reward MDP (2022)0.00
- Optimal Sample Complexity For Average Reward Markov Decision Processes (2023)0.00
- Breaking The Sample Size Barrier In Model-based Reinforcement Learning With A Generative Model (2020)9.03
- Regret-optimal Model-free Reinforcement Learning For Discounted Mdps With Short Burn-in Time (2023)0.00
- Breaking The Sample Complexity Barrier To Regret-optimal Model-free Reinforcement Learning (2021)0.00
- Entropic Risk Optimization In Discounted Mdps: Sample Complexity Bounds With A Generative Model (2025)0.00
- Adaptive Sampling For Best Policy Identification In Markov Decision Processes (2020)0.00
- Model-free Reinforcement Learning In Infinite-horizon Average-reward Markov Decision Processes (2019)0.00