MARSHAL: Incentivizing Multi-agent Reasoning Via Self-play With Strategic Llms
2025 Β· Huining Yuan, Zelai Xu, Zheyue Tan, et al.
Abstract
Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARSHAL, an end-to-end RL framework that incentivizes Multi-Agent Reasoning through Self-play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to 28.7%
Authors
(none)
Tags
Stats
Related papers
- YOLO-MARL: You Only LLM Once For Multi-agent Reinforcement Learning (2024)0.00
- Language-driven Coordination And Learning In Multi-agent Simulation Environments (2025)0.00
- Language Agents With Reinforcement Learning For Strategic Play In The Werewolf Game (2023)0.00
- Reinforcing Competitive Multi-agents For Playing 'so Long Sucker' (2024)0.00
- Towards Collaborative Intelligence: Propagating Intentions And Reasoning For Multi-agent Coordination With Large Language Models (2024)0.00
- End-to-end Optimization Of Llm-driven Multi-agent Search Systems Via Heterogeneous-group-based Reinforcement Learning (2025)0.00
- MAGE: Meta-reinforcement Learning For Language Agents Toward Strategic Exploration And Exploitation (2026)0.00
- Incentivize Without Bonus: Provably Efficient Model-based Online Multi-agent RL For Markov Games (2025)0.00