CLAIJan 11, 2024

Combating Adversarial Attacks with Multi-Agent Debate

arXiv:2401.05998v114 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the issue of adversarial attacks on language models for users relying on safe AI outputs, but it is incremental as it builds on existing multi-agent debate methods.

The paper tackled the problem of language models being vulnerable to adversarial attacks by implementing multi-agent debate between models, finding that it reduces toxicity when jailbroken or less capable models debate with non-jailbroken or more capable ones, with marginal general improvements.

While state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arXiv:2305.14325. We implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. We also find marginal improvements through the general usage of multi-agent interactions. We further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes