Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning For Hanabi
2022 Β· Bram Grooten, Jelle Wemmenhove, Maurice Poot, et al.
Abstract
In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Gradient over multiple random seeds in a simplified environment of the multi-agent cooperative card game. In our analysis of this behavior we look into Hanabi-specific metrics and hypothesize a reason for PPO's plateau. In addition, we provide proofs for the maximum length of a perfect game (71 turns) and any game (89 turns). Our code can be found at: https://github.com/bramgrooten/DeepRL-for-Hanabi
Authors
(none)
Tags
Stats
Code
Related papers
- The Surprising Effectiveness Of PPO In Cooperative, Multi-agent Games (2021)0.00
- Reevaluating Policy Gradient Methods For Imperfect-information Games (2025)0.00
- Simplified Action Decoder For Deep Multi-agent Reinforcement Learning (2019)4.03
- Settling The Variance Of Multi-agent Policy Gradients (2021)0.00
- Theory Of Mind For Deep Reinforcement Learning In Hanabi (2021)0.00
- Implementation Matters In Deep Policy Gradients: A Case Study On PPO And TRPO (2020)0.00
- Revisiting Design Choices In Proximal Policy Optimization (2020)0.00
- Discovering Diverse Multi-agent Strategic Behavior Via Reward Randomization (2021)0.00