Learning Policies From Self-play With Policy Gradients And MCTS Value Estimates
2019 Β· Dennis J. N. J. Soemers, Γric Piette, Matthew Stephenson, et al.
Abstract
In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value esti
Authors
(none)
Tags
Stats
Related papers
- Policy Gradient Search: Online Planning And Expert Iteration Without Search Trees (2019)0.00
- Policy Gradient Algorithms With Monte Carlo Tree Learning For Non-markov Decision Processes (2022)0.00
- Multiple Policy Value Monte Carlo Tree Search (2019)0.00
- Softtreemax: Policy Gradient With Tree Search (2022)0.00
- Behind The Myth Of Exploration In Policy Gradients (2024)0.00
- Combining Off And On-policy Training In Model-based Reinforcement Learning (2021)0.00
- Efficient Competitive Self-play Policy Optimization (2020)0.00
- Learning Self-imitating Diverse Policies (2018)0.00