Abstract

Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an \(\epsilon\)-approximate Nash policy in \(\tilde\{\mathcal\{O\}\}(H^3SAB/\epsilon^2)\) episodes of game playing, where \(S\) is the number of states, \(A,B\) are the number of actions for the two players respectively, and \(H\) is the

Authors

(none)

Tags

  • Model-Based RL
  • Multi-Agent
  • Value-Based

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyliu2020a

Related papers