Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
This addresses safety risks for LLM agents in high-stakes applications, though it is incremental as it builds on existing uncertainty quantification methods.
The paper tackles the safety problem of LLM agents operating in complex environments by proposing selective quitting as a behavioral mechanism, showing that agents prompted to quit improve safety by an average of +0.39 on a 0-3 scale while maintaining helpfulness with only a -0.03 decrease.
As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.