Provably Efficient Fictitious Play Policy Optimization For Zero-sum Markov Games With Structured Transitions

Abstract

While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight \(\widetilde\{\mathcal\{O\}\}(\sqrt\{K\})\) regret bounds after \(K\) episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of

Provably Efficient Fictitious Play Policy Optimization For Zero-sum Markov Games With Structured Transitions

Abstract

Authors

Tags

Stats

Related papers