Score Vs. Winrate In Score-based Games: Which Reward For Reinforcement Learning?
2022 Β· Luca Pasqualini, Gianluca Amato, Marco Fantozzi, et al.
Abstract
In the last years, the DeepMind algorithm AlphaZero has become the state of the art to efficiently tackle perfect information two-player zero-sum games with a win/lose outcome. However, when the win/lose outcome is decided by a final score difference, AlphaZero may play score-suboptimal moves because all winning final positions are equivalent from the win/lose outcome perspective. This can be an issue, for instance when used for teaching, or when trying to understand whether there is a better move. Moreover, there is the theoretical quest for the perfect game. A naive approach would be training an AlphaZero-like agent to predict score differences instead of win/lose outcomes. Since the game of Go is deterministic, this should as well produce an outcome-optimal play. However, it is a folklore belief that "this does not work". In this paper, we first provide empirical evidence for this belief. We then give a theoretical interpretation of this suboptimality in general perfect informatio
Authors
(none)
Tags
Stats
Related papers
- Impartial Games: A Challenge For Reinforcement Learning (2022)0.00
- Can Meta-interpretive Learning Outperform Deep Reinforcement Learning Of Evaluable Game Strategies? (2019)0.00
- Analysis Of Hyper-parameters For Small Games: Iterations Or Epochs In Self-play? (2020)0.00
- Rinascimento: Using Event-value Functions For Playing Splendor (2020)2.26
- Learning To Win, Lose And Cooperate Through Reward Signal Evolution (2021)0.00
- Combining Deep Reinforcement Learning And Search For Imperfect-information Games (2020)0.00
- Regret-guided Search Control For Efficient Learning In Alphazero (2026)0.00
- A Minimaximalist Approach To Reinforcement Learning From Human Feedback (2024)0.00