Anecdoctoring: Automated Red-Teaming Across Language and Place
This addresses the need for scalable disinformation mitigations in generative AI across diverse global contexts, though it is incremental as it builds on existing red-teaming approaches.
The paper tackled the problem of red-teaming evaluations for generative AI being limited to US- and English-centric datasets by proposing 'anecdoctoring', an automated method to generate adversarial prompts across languages and cultures, resulting in higher attack success rates compared to few-shot prompting.
Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose "anecdoctoring", a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.