← all papers · overview

Discovering Agentic Safety Specifications From 1-bit Danger Signals

·2026

Abstract

Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function RR^*, only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward RR may diverge from RR^*. EPO-Safe discovers safe behavior within 1-2 ro

Related papers

Ranked by semantic similarity — how closely each paper's abstract matches this one (100% = near-identical topic).