Optimistic Policy Learning Under Pessimistic Adversaries With Regret And Violation Guarantees

·2026

arXiv:ganguly2026optimistic ↗Google Scholar ↗Semantic Scholar ↗

Abstract

Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf\{exogenous factors outside its control\}--competing agents, environmental disturbances, or strategic adversaries--formally, \(s_\{h+1\} = f(s_h, a_h, \bar\{a\}_h)+\omega_h\) where \(\bar\{a\}_h\) is the adversary/external action, \(a_h\) is the agent's action, and \(\omega_h\) is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf\{fail catastrophically in deployment\}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf\{strategic interaction\} between agent and exogenous factor, and rely on strong assu

Abstract

Related papers