Policy-value Alignment And Robustness In Search-based Multi-agent Learning
2023 Β· Niko A. Grupen, Michael Hanlon, Alexis Hao, et al.
Abstract
Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by u
Authors
(none)
Tags
Stats
Related papers
- Targeted Search Control In Alphazero For Effective Policy Improvement (2023)0.00
- Modeling Strong And Human-like Gameplay With Kl-regularized Search (2021)0.00
- Robust And Diverse Multi-agent Learning Via Rational Policy Gradient (2025)0.00
- Learning Policies From Self-play With Policy Gradients And MCTS Value Estimates (2019)0.00
- Discovering Multiagent Learning Algorithms With Large Language Models (2026)2.05
- SUB-PLAY: Adversarial Policies Against Partially Observed Multi-agent Reinforcement Learning Systems (2024)0.00
- Vision-based Generic Potential Function For Policy Alignment In Multi-agent Reinforcement Learning (2025)0.00
- Role Play: Learning Adaptive Role-specific Strategies In Multi-agent Interactions (2024)0.00