CRAIFeb 18, 2025

DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent

arXiv:2502.12575v219 citationsh-index: 9Has CodeEMNLP
Originality Highly original
AI Analysis

This exposes critical safety vulnerabilities in LLM-based agents, highlighting limitations of existing defenses and the need for more robust security measures.

The paper tackles the problem of backdoor attacks on LLM-based agents being detectable by safety audits, proposing a dynamically encrypted multi-backdoor implantation attack that achieves near 100% attack success rate with 0% detection rate.

As LLM-based agents become increasingly prevalent, backdoors can be implanted into agents through user queries or environment feedback, raising critical concerns regarding safety vulnerabilities. However, backdoor attacks are typically detectable by safety audits that analyze the reasoning process of agents. To this end, we propose a novel backdoor implantation strategy called \textbf{Dynamically Encrypted Multi-Backdoor Implantation Attack}. Specifically, we introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits. To enhance stealthiness, we further decompose the backdoor into multiple sub-backdoor fragments. Based on these advancements, backdoors are allowed to bypass safety audits significantly. Additionally, we present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks. Experimental results across multiple datasets demonstrate that our method achieves an attack success rate nearing 100\% while maintaining a detection rate of 0\%, illustrating its effectiveness in evading safety audits. Our findings highlight the limitations of existing safety mechanisms in detecting advanced attacks, underscoring the urgent need for more robust defenses against backdoor threats. Code and data are available at https://github.com/whfeLingYu/DemonAgent.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes