Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-regret Learning In Markov Games

Abstract

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep\{liu2022learning\}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline\{D\}ecentralized \underline\{O\}ptimistic hype\underline\{R\}policy m\underline\{I\}rror de\underline\{S\}cent (DORIS), which achieves \(\sqrt\{K\}\)-regret in the context of general function approximation, where \(K\) is the number of episodes. Moreover, when all the agents adopt DORIS, we pro

Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-regret Learning In Markov Games

Abstract

Authors

Tags

Stats

Related papers