CRAICLMar 6

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

arXiv:2603.05786v14 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of verifying AI agent safety claims for users, particularly when developers might falsely advertise safety measures, though it acknowledges the risk of active jailbreaking.

This paper introduces proof-of-guardrail, a system that allows AI agent developers to cryptographically prove that a response was generated after a specific open-source guardrail. This is achieved by running the agent and guardrail within a Trusted Execution Environment (TEE), which generates a verifiable TEE-signed attestation of guardrail code execution.

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes