Reevaluating Policy Gradient Methods For Imperfect-information Games
2025 Β· Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, et al.
Abstract
In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and ht
Authors
(none)
Tags
Stats
Code
Related papers
- Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning For Hanabi (2022)0.00
- Optimistic Natural Policy Gradient: A Simple Efficient Policy Optimization Framework For Online RL (2023)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Fidelity-induced Interpretable Policy Extraction For Reinforcement Learning (2023)0.00
- Actor-critic Policy Optimization In Partially Observable Multiagent Environments (2018)0.00
- Robust And Diverse Multi-agent Learning Via Rational Policy Gradient (2025)0.00
- Implementation Matters In Deep Policy Gradients: A Case Study On PPO And TRPO (2020)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00