Containment Verification: AI Safety Guarantees Independent of Alignment

arXiv:2605.0904566.1

AI Analysis

This work provides a new approach to AI safety that is independent of model alignment, addressing the challenge of verifying safety in agentic systems for AI safety researchers.

The paper introduces containment verification, a method to guarantee AI safety at the agentic framework level rather than the model level, proving that for boundary-enforceable properties, the safety guarantee holds for any possible AI output. They verify PocketFlow, a minimalist agentic LLM framework, using Dafny and an agentic synthesis pipeline.

Agentic frameworks are the software layer through which AI agents act in the world. Existing safety methods intervene on the model and therefore remain conditional on unverifiable properties of learned behavior. We introduce containment verification, which locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof under an information barrier against tautological specifications. To our knowledge, this is the first deductive formal verification of an agentic framework, and its guarantee is invariant to model capability over the modeled typed action boundary.

View on arXiv PDF

Similar