Toward a Principled Framework for Agent Safety Measurement
For researchers and practitioners evaluating safety of LLM agents, this work addresses the problem of missing low-probability unsafe behaviors in current evaluations.
The paper argues that agent safety should be measured by search rather than sampling, and introduces BOA, a framework that searches trajectory space under a likelihood budget to compute a safety score. BOA discovers unsafe trajectories missed by greedy or sampled evaluations, enabling ranking of models, defenses, and attacks with manageable GPU costs.
LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.