A Sharp Analysis Of Model-based Reinforcement Learning With Self-play

Abstract

Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an \(\epsilon\)-approximate Nash policy in \(\tilde\{\mathcal\{O\}\}(H^3SAB/\epsilon^2)\) episodes of game playing, where \(S\) is the number of states, \(A,B\) are the number of actions for the two players respectively, and \(H\) is the

A Sharp Analysis Of Model-based Reinforcement Learning With Self-play

Abstract

Authors

Tags

Stats

Related papers