LGAICLCRDec 13, 2024

AdvPrefix: An Objective for Nuanced LLM Jailbreaks

arXiv:2412.10321v117 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific vulnerability in LLM alignment for security researchers, showing incremental improvements in jailbreak techniques.

The paper tackled the problem of limited control and rigid format in jailbreak attacks on large language models by introducing AdvPrefix, a new prefix-forcing objective that improved nuanced attack success rates from 14% to 80% on Llama-3.

Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix "Sure, here is (harmful request)". While straightforward, this objective has two limitations: limited control over model behaviors, often resulting in incomplete or unrealistic responses, and a rigid format that hinders optimization. To address these limitations, we introduce AdvPrefix, a new prefix-forcing objective that enables more nuanced control over model behavior while being easy to optimize. Our objective leverages model-dependent prefixes, automatically selected based on two criteria: high prefilling attack success rates and low negative log-likelihood. It can further simplify optimization by using multiple prefixes for a single user request. AdvPrefix can integrate seamlessly into existing jailbreak attacks to improve their performance for free. For example, simply replacing GCG attack's target prefixes with ours on Llama-3 improves nuanced attack success rates from 14% to 80%, suggesting that current alignment struggles to generalize to unseen prefixes. Our work demonstrates the importance of jailbreak objectives in achieving nuanced jailbreaks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes