Off-belief Learning
2021 Β· Hengyuan Hu, Adam Lerer, Brandon Cui, et al.
Abstract
The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents' actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy \(\pi_1\) that is optimized assuming past actions were taken by a given, fixed policy (\(\pi_0\)), but assuming that future actions will be taken by \(\pi_1\). When \(\pi_0\) is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents' behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike exis
Authors
(none)
Tags
Stats
Related papers
- Agent-state Based Policies In Pomdps: Beyond Belief-state Mdps (2024)0.00
- Probing Dec-pomdp Reasoning In Cooperative MARL (2026)0.00
- Finite-state Controllers For (hidden-model) Pomdps Using Deep Reinforcement Learning (2026)0.00
- How To Explore With Belief: State Entropy Maximization In Pomdps (2024)0.00
- Robust Asymmetric Learning In Pomdps (2020)0.00
- Centralized Model And Exploration Policy For Multi-agent RL (2021)0.00
- Statistical Tractability Of Off-policy Evaluation Of History-dependent Policies In Pomdps (2025)0.00
- Policy Evaluation In Decentralized Pomdps With Belief Sharing (2023)0.00