Abstract

We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during \(T\) episodes, each of which consists of \(H\) stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order \(\tilde\{\mathcal\{O\}\}(\mathrm\{poly\}(H)\sqrt\{SAT\})\), where \(S\) and \(A\) are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of \(\sqrt\{S\}\), bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound \(Ξ©(\sqrt\{H^3SAT\})\) as far as the dependencies in \(S,A,T\) are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynami

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers