Trust Region Policy Optimisation In Multi-agent Reinforcement Learning
2021 Β· Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, et al.
Abstract
Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any
Authors
(none)
Tags
Stats
Related papers
- Multi-agent Trust Region Policy Optimization (2020)12.61
- Trust Region Bounds For Decentralized PPO Under Non-stationarity (2022)0.00
- Heterogeneous-agent Reinforcement Learning (2023)0.00
- Hindsight Trust Region Policy Optimization (2019)0.00
- Multi-agent Constrained Policy Optimisation (2021)0.00
- Heterogeneous Multi-agent Reinforcement Learning Via Mirror Descent Policy Optimization (2023)0.00
- Embedding Safety Into RL: A New Take On Trust Region Methods (2024)0.00
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10