Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren

arXiv:2603.05786v110.34 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of verifying AI agent safety claims for users, particularly when developers might falsely advertise safety measures, though it acknowledges the risk of active jailbreaking.

This paper introduces proof-of-guardrail, a system that allows AI agent developers to cryptographically prove that a response was generated after a specific open-source guardrail. This is achieved by running the agent and guardrail within a Trusted Execution Environment (TEE), which generates a verifiable TEE-signed attestation of guardrail code execution.

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

View on arXiv PDF Code

Similar