CRAIFeb 13, 2025

AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration

arXiv:2502.09809v119 citationsh-index: 1
Originality Highly original
AI Analysis

This work addresses the problem of safety evaluation for LLM agents with tool-use capability, which is crucial for their trustworthiness in real-world applications.

The authors tackled the problem of ensuring the safety of large language models (LLMs) integrated with tool use, and proposed AgentGuard, a framework that can discover and validate unsafe tool-use workflows, achieving a baseline of safety guarantee. AgentGuard demonstrated feasibility through experiments.

The integration of tool use into large language models (LLMs) enables agentic systems with real-world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool-use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool-use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator's innate capabilities - knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges - to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real-world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard's feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real-world applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes