LG CLJun 20, 2024

Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

arXiv:2406.14393v517.613 citationsHas Code

Originality Highly original

AI Analysis

This addresses safety concerns for users of aligned LLMs by providing a novel perspective on adversarial attacks, though it is incremental in building upon existing red teaming methods.

The paper tackles the vulnerability of large language models to adversarial attacks by attributing it to reward misspecification during alignment, introducing a metric ReGap to detect harmful prompts and a system ReMiss that achieves state-of-the-art attack success rates on benchmarks like AdvBench.

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. This misspecification occurs when the reward function fails to accurately capture the intended behavior, leading to misaligned model outputs. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark against various target aligned LLMs while preserving the human readability of the generated prompts. Furthermore, these attacks on open-source models demonstrate high transferability to closed-source models like GPT-4o and out-of-distribution tasks from HarmBench. Detailed analysis highlights the unique advantages of the proposed reward misspecification objective compared to previous methods, offering new insights for improving LLM safety and robustness.

View on arXiv PDF Code

Similar