CRMay 2

Toward a Principled Framework for Agent Safety Measurement

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

arXiv:2605.0164481.7

Predicted impact top 11% in CR · last 90 daysOriginality Highly original

AI Analysis

For researchers and practitioners evaluating safety of LLM agents, this work addresses the problem of missing low-probability unsafe behaviors in current evaluations.

The paper argues that agent safety should be measured by search rather than sampling, and introduces BOA, a framework that searches trajectory space under a likelihood budget to compute a safety score. BOA discovers unsafe trajectories missed by greedy or sampled evaluations, enabling ranking of models, defenses, and attacks with manageable GPU costs.

LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.

View on arXiv PDF

Similar