Regret-aware Policy Optimization: Environment-level Memory For Replay Suppression Under Delayed Harm
2026 Β· Prakul Sunil Hiremath
Abstract
Safety in reinforcement learning (RL) is typically enforced through objective shaping while keeping environment dynamics stationary with respect to observable state-action pairs. Under delayed harm, this can lead to replay: after a washout period, reintroducing the same stimulus under matched observable conditions reproduces a similar harmful cascade. We introduce the Replay Suppression Diagnostic (RSD), a controlled exposure-decay-replay protocol that isolates this failure mode under frozen-policy evaluation. We show that, under stationary observable transition kernels, replay cannot be structurally suppressed without inducing a persistent shift in replay-time action distributions. Motivated by platform-mediated systems, we propose Regret-Aware Policy Optimization (RAPO), which augments the environment with persistent harm-trace and scar fields and applies a bounded, mass-preserving transition reweighting to reduce reachability of historically harmful regions. On graph diffusion
Authors
(none)
Tags
Stats
Related papers
- Replay For Safety (2021)0.00
- Remember And Forget For Experience Replay (2018)0.00
- Regret-based Defense In Adversarial Reinforcement Learning (2023)0.00
- Adamemento: Adaptive Memory-assisted Policy Optimization For Reinforcement Learning (2024)0.00
- On The Convergence Of Experience Replay In Policy Optimization: Characterizing Bias, Variance, And Finite-time Convergence (2021)0.00
- Regret Minimization Experience Replay In Off-policy Reinforcement Learning (2021)0.00
- DOPE: Doubly Optimistic And Pessimistic Exploration For Safe Reinforcement Learning (2021)0.00
- Map-based Experience Replay: A Memory-efficient Solution To Catastrophic Forgetting In Reinforcement Learning (2023)9.23