Online Learning For Uninformed Markov Games: Empirical Nash-value Regret And Non-stationarity Adaptation

Abstract

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length \(H\). They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret \(O(K^\{2/3\})\) after \(K\) episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus \(O(\sqrt\{K\})\) external regret is well-known to be achievable, their result is still the worse rate \(O(K^\{2/3\})\) on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new me

Online Learning For Uninformed Markov Games: Empirical Nash-value Regret And Non-stationarity Adaptation

Abstract

Authors

Tags

Stats

Related papers