CR AIMay 10

FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments

Astha Mehta, Niruthiha Selvanayagam, Cedric Lam, Hengxu Li, Phuc-Nguyen Nguyen, Raymond Lee, Olivia McGoffin, My, Luong, Arthur Collé, Jamie Johnson, David Williams-King

arXiv:2605.1102978.1Has Code

Predicted impact top 13% in CR · last 90 daysOriginality Highly original

AI Analysis

For LLM safety researchers, this work highlights a new attack vector (cross-session fragmentation) that existing benchmarks miss, and provides a benchmark and baseline detectors.

FragBench introduces a benchmark for cross-session LLM attacks where malicious goals are split into benign-looking fragments across separate sessions. Graph-based detectors achieve event-level F1 scores of 0.88-0.96, showing that defense requires modeling cross-session interaction graphs.

An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.

View on arXiv PDF Code

Similar