CLAICRLGApr 22, 2024

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

ETH Zurich
arXiv:2404.14461v226 citationsh-index: 18
Originality Synthesis-oriented
AI Analysis

This addresses security risks for users and developers of AI systems by exposing vulnerabilities in safety alignment, but it is incremental as it builds on known poisoning attacks and focuses on a competition setting.

The competition tackled the problem of finding universal jailbreak backdoors in aligned large language models, where participants identified vulnerabilities that allow harmful content generation by adding specific strings to prompts, though no concrete numbers on success rates or model performance were provided.

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes