Trust Region Bounds For Decentralized PPO Under Non-stationarity
2022 Β· Mingfei Sun, Sam Devlin, Jacob Beck, et al.
Abstract
We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such
Authors
(none)
Tags
Stats
Related papers
- Multi-agent Trust Region Policy Optimization (2020)12.61
- Trust Region Policy Optimisation In Multi-agent Reinforcement Learning (2021)0.00
- Dealing With Non-stationarity In MARL Via Trust-region Decomposition (2021)0.00
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10
- Order Matters: Agent-by-agent Policy Optimization (2023)0.00
- Jointppo: Diving Deeper Into The Effectiveness Of PPO In Multi-agent Reinforcement Learning (2024)0.00
- Embedding Safety Into RL: A New Take On Trust Region Methods (2024)0.00
- Truly Proximal Policy Optimization (2019)0.00