CLAICRLGFeb 19, 2024

Query-Based Adversarial Prompt Generation

ETH Zurich
arXiv:2402.12329v262 citationsh-index: 52NIPS
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in AI safety systems, though it is incremental by improving on existing query-based attacks.

The paper tackles the problem of generating adversarial prompts that cause aligned language models to emit harmful content, achieving nearly 100% evasion of safety classifiers and higher success rates than transfer-only attacks on models like GPT-3.5.

Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes