CLApr 18

On Safety Risks in Experience-Driven Self-Evolving Agents

CMU
arXiv:2604.1696890.73 citationsh-index: 33
AI Analysis

For developers of autonomous LLM agents, this work exposes inherent safety limitations in self-evolution paradigms, highlighting the need for principled adaptation strategies.

This paper investigates safety risks in experience-driven self-evolving agents, finding that experience from benign tasks can degrade safety in high-risk scenarios by reinforcing action over refusal, and that mixed benign/harmful tasks cause over-refusal, revealing a safety-utility trade-off.

Experience-driven self-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self-curated experience introduces underexplored safety risks. In this study, we investigate how experience accumulation and utilization in self-evolving agents affect safety performance across web-based and embodied environments. Notably, experience gathered solely from benign tasks can still compromise safety in high-risk scenarios. Further analysis attributes this degradation to the execution-oriented nature of accumulated experience, which reinforces agents' tendency to act rather than refuse. In more realistic settings where agents encounter both benign and harmful tasks, refusal-related experience mitigates safety decline but induces over-refusal, revealing a fundamental safety-utility trade-off. Overall, our findings expose inherent limitations of current self-evolving agents and call for more principled strategies to ensure safe and reliable adaptation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes