Golden Handcuffs Make Safer AI Agents
2026 Β· Aram Ebtekar, Michael K. Cohen
Abstract
Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value \(-L\), while the true environment's rewards lie in \([0,1]\). After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to \(-L\). We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.
Authors
(none)
Tags
Stats
Related papers
- Learning Safe Policies With Expert Guidance (2018)0.00
- Strategy Masking: A Method For Guardrails In Value-based Reinforcement Learning Agents (2025)0.00
- DOPE: Doubly Optimistic And Pessimistic Exploration For Safe Reinforcement Learning (2021)0.00
- Conservative And Adaptive Penalty For Model-based Safe Reinforcement Learning (2021)0.00
- Curiosity Killed Or Incapacitated The Cat And The Asymptotically Optimal Agent (2020)0.00
- Taming Equilibrium Bias In Risk-sensitive Multi-agent Reinforcement Learning (2024)0.00
- Training Generalizable Collaborative Agents Via Strategic Risk Aversion (2026)0.00
- Safe Reinforcement Learning In Black-box Environments Via Adaptive Shielding (2024)2.26