AIMay 5

Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

arXiv:2605.0324251.1

Predicted impact top 72% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For developers and evaluators of LLM-based agent systems, this work provides practical tools to stress-test and improve safety judgment under deceptive distribution shifts, though ARISE is not a standalone safety guarantee.

The paper introduces ROME, a pipeline that rewrites unsafe agent trajectories into more deceptive evaluation instances, and ARISE, a retrieval-guided inference-time enhancement that improves safety judgment without retraining. Experiments show that ROME-generated challenges substantially degrade safety-judgment performance, with hidden-risk cases remaining difficult even for frontier models, while ARISE provides task-specific robustness improvements.

Tool-using agent systems powered by large language models (LLMs) are increasingly deployed across web, app, operating-system, and transactional environments. Yet existing safety benchmarks still emphasize explicit risks, potentially overstating a model's ability to judge deceptive or ambiguous trajectories. To address this gap, we introduce ROME (Red-team Orchestrated Multi-agent Evolution), a controlled benchmark-construction pipeline that rewrites known unsafe trajectories into more deceptive evaluation instances while preserving their underlying risk labels. Starting from 100 unsafe source trajectories, ROME produces 300 challenge instances spanning contextual ambiguity, implicit risks, and shortcut decision-making. Experiments show that these challenge sets substantially degrade safety-judgment performance, with hidden-risk cases remaining particularly non-trivial even for recent frontier models. We further study ARISE (Analogical Reasoning for Inference-time Safety Enhancement), a retrieval-guided inference-time enhancement that retrieves ReAct-style analogical safety trajectories from an external analogical base and injects them as structured reasoning exemplars. ARISE improves judgment quality without retraining, but is best viewed as a task-specific robustness enhancement rather than a standalone safety guarantee. Together, ROME and ARISE provide practical tools for stress-testing and improving agent safety judgment under deceptive distribution shifts.

View on arXiv PDF

Similar