Benchmarking Misuse Mitigation Against Covert Adversaries
For AI safety researchers, this work highlights a practical attack vector and provides benchmarks to evaluate defenses against covert misuse.
The authors identify that existing safety evaluations overlook covert attacks where an attacker decomposes a dangerous task into many benign-seeming queries. They develop BSD, a data generation pipeline for evaluating such attacks, and find that decomposition attacks are effective at bypassing frontier model safeguards, while stateful defenses show promise as a countermeasure.
Existing language model safety evaluations focus on overt attacks and low-stakes tasks. In reality, an attacker can easily subvert existing safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because the individual queries do not appear harmful, the attack is hard to detect. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. This enables us to evaluate decomposition attacks, which are found to be effective misuse enablers, and to highlight stateful defenses as a promising countermeasure.