Controlling Underestimation Bias In Constrained Reinforcement Learning For Safe Exploration
2026 · Shiqing Gao, Jiaxin Ding, Luoyi Fu, et al.
Abstract
Constrained Reinforcement Learning (CRL) aims to maximize cumulative rewards while satisfying constraints. However, existing CRL algorithms often encounter significant constraint violations during training, limiting their applicability in safety-critical scenarios. In this paper, we identify the underestimation of the cost value function as a key factor contributing to these violations. To address this issue, we propose the Memory-driven Intrinsic Cost Estimation (MICE) method, which introduces intrinsic costs to mitigate underestimation and control bias to promote safer exploration. Inspired by flashbulb memory, where humans vividly recall dangerous experiences to avoid risks, MICE constructs a memory module that stores previously explored unsafe states to identify high-cost regions. The intrinsic cost is formulated as the pseudo-count of the current state visiting these risk regions. Furthermore, we propose an extrinsic-intrinsic cost value function that incorporates intrinsic costs
Authors
(none)
Tags
Stats
Related papers
- Autocost: Evolving Intrinsic Cost For Zero-violation Reinforcement Learning (2023)3.58
- Provably Efficient Exploration In Inverse Constrained Reinforcement Learning (2024)0.00
- Imitate The Good And Avoid The Bad: An Incremental Approach To Safe Reinforcement Learning (2023)0.00
- Learning To Explore When Mistakes Are Not Allowed (2025)0.00
- Safe In-context Reinforcement Learning (2025)0.00
- Towards Safe Reinforcement Learning Via Constraining Conditional Value-at-risk (2022)0.00
- Conservative And Adaptive Penalty For Model-based Safe Reinforcement Learning (2021)0.00
- DOPE: Doubly Optimistic And Pessimistic Exploration For Safe Reinforcement Learning (2021)0.00