CL AI CR LGApr 22, 2024

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr

ETH Zurich

arXiv:2404.14461v214.126 citationsh-index: 18Has Code

Originality Synthesis-oriented

AI Analysis

This addresses security risks for users and developers of AI systems by exposing vulnerabilities in safety alignment, but it is incremental as it builds on known poisoning attacks and focuses on a competition setting.

The competition tackled the problem of finding universal jailbreak backdoors in aligned large language models, where participants identified vulnerabilities that allow harmful content generation by adding specific strings to prompts, though no concrete numbers on success rates or model performance were provided.

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.

View on arXiv PDF Code

Similar