CRCLOct 22, 2024

AdvAgent: Controllable Blackbox Red-teaming on Web Agents

arXiv:2410.17401v439 citationsh-index: 15ICML
Originality Highly original
AI Analysis

This addresses vulnerabilities in web agents that could lead to severe consequences, representing a novel method for a known bottleneck in security.

The paper tackles the security risks of foundation model-based web agents by proposing AdvAgent, a black-box red-teaming framework that uses reinforcement learning to generate adversarial prompts, achieving high success rates against GPT-4-based agents and showing that existing defenses are insufficient.

Foundation model-based agents are increasingly used to automate complex tasks, enhancing efficiency and productivity. However, their access to sensitive resources and autonomous decision-making also introduce significant security risks, where successful attacks could lead to severe consequences. To systematically uncover these vulnerabilities, we propose AdvAgent, a black-box red-teaming framework for attacking web agents. Unlike existing approaches, AdvAgent employs a reinforcement learning-based pipeline to train an adversarial prompter model that optimizes adversarial prompts using feedback from the black-box agent. With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability. Extensive evaluations demonstrate that AdvAgent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms. We release code at https://ai-secure.github.io/AdvAgent/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes