Structural Enforcement Of Goal Integrity In AI Agents Via Separation-of-powers Architecture
2026 Β· Rong Xiang
Abstract
arXiv:2604.23646v1 Announce Type: new Abstract: Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic an
Authors
(none)
Tags
Stats
Related papers
- A Regulation Enforcement Solution For Multi-agent Reinforcement Learning (2019)2.26
- Formal Ethical Obligations In Reinforcement Learning Agents: Verification And Policy Updates (2024)0.00
- Discovering Agentic Safety Specifications From 1-bit Danger Signals (2026)0.00
- Strategy Masking: A Method For Guardrails In Value-based Reinforcement Learning Agents (2025)0.00
- Policy-value Alignment And Robustness In Search-based Multi-agent Learning (2023)0.00
- An Abstraction-based Method To Check Multi-agent Deep Reinforcement-learning Behaviors (2021)2.26
- Towards Measuring Goal-directedness In AI Systems (2024)0.00
- Think Smart, Act SMARL! Analyzing Probabilistic Logic Shields For Multi-agent Reinforcement Learning (2024)0.00