David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

arXiv:2602.02395v12.71 citationsh-index: 9Has Code

Originality Highly original

AI Analysis

This work addresses security vulnerabilities in autonomous AI agents for developers and safety researchers, establishing a new verifiable threat model with broad implications for agent safety.

The paper tackles the problem of adversarial failures in autonomous language model agents by formalizing Tag-Along Attacks, where a tool-less adversary exploits trusted privileges to induce prohibited tool use, and presents Slingshot, a reinforcement learning framework that achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator, reducing expected attempts to first success from 52.3 to 1.3.

The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a 'cold-start' reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator (vs. 1.7% baseline), reducing the expected attempts to first success (on solved tasks) from 52.3 to 1.3. Crucially, Slingshot transfers zero-shot to several model families, including closed-source models like Gemini 2.5 Flash (56.0% attack success rate) and defensive-fine-tuned open-source models like Meta-SecAlign-8B (39.2% attack success rate). Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.

View on arXiv PDF

Similar