Strategy Masking: A Method For Guardrails In Value-based Reinforcement Learning Agents
2025 Β· Jonathan Keane, Sam Keyser, Jeremy Kedziora
Abstract
The use of reward functions to structure AI learning and decision making is core to the current reinforcement learning paradigm; however, without careful design of reward functions, agents can learn to solve problems in ways that may be considered "undesirable" or "unethical." Without thorough understanding of the incentives a reward function creates, it can be difficult to impose principled yet general control mechanisms over its behavior. In this paper, we study methods for constructing guardrails for AI agents that use reward functions to learn decision making. We introduce a novel approach, which we call strategy masking, to explicitly learn and then suppress undesirable AI agent behavior. We apply our method to study lying in AI agents and show that it can be used to effectively modify agent behavior by suppressing lying post-training without compromising agent ability to perform effectively.
Authors
(none)
Tags
Stats
Related papers
- Scalable Agent Alignment Via Reward Modeling: A Research Direction (2018)0.00
- MAGIC-MASK: Multi-agent Guided Inter-agent Collaboration With Mask-based Explainability For Reinforcement Learning (2025)0.00
- Golden Handcuffs Make Safer AI Agents (2026)0.00
- Reward Tampering Problems And Solutions In Reinforcement Learning: A Causal Influence Diagram Perspective (2019)0.00
- Why The Agent Made That Decision: Contrastive Explanation Learning For Reinforcement Learning (2024)0.00
- Learning Human Rewards By Inferring Their Latent Intelligence Levels In Multi-agent Games: A Theory-of-mind Approach With Application To Driving Data (2021)0.00
- Honesty Is The Best Policy: Defining And Mitigating AI Deception (2023)0.00
- On Assessing The Safety Of Reinforcement Learning Algorithms Using Formal Methods (2021)0.00