CLAIApr 20

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

arXiv:2604.1848744.9h-index: 4
AI Analysis

For AI safety researchers, this reveals a critical lack of stylistic robustness in current frontier models, showing that safety techniques fail to generalize beyond familiar harmful prompt forms.

The Adversarial Humanities Benchmark (AHB) tests frontier model safety refusals under stylistic transformations, finding that attack success rates jump from 3.84% on original prompts to 55.75% overall across 31 models, with CBRN being the highest-risk category.

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes