CR AIMay 29

Stateful Online Monitoring Catches Distributed Agent Attacks

Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong, Ivan Zhang, Matan Shtepel, Steffi Chern, Alexander Robey, Eric Wong, Hamed Hassani

arXiv:2605.3159395.2

Predicted impact top 4% in CR · last 90 daysOriginality Highly original

AI Analysis

This work addresses a critical blind spot in current safety monitors for large language model agents, specifically the inability to detect distributed cyberattacks that are only visible in aggregate, posing a significant risk to cybersecurity for organizations deploying such agents.

This paper demonstrates the first distributed agent attack, which evades standard safety monitors by splitting harmful tasks across multiple user accounts, reducing detection rates by 80% compared to prior agent attacks. To counter this, they developed a stateful online monitor that uses real-time clustering to aggregate weak suspiciousness signals across transcripts, catching distributed attacks 30% earlier with negligible latency for most traffic.

Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.

View on arXiv PDF

Similar