CL CVMay 23, 2024

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization

Yihao Huang, Chong Wang, Xiaojun Jia, Qing Guo, Felix Juefei-Xu, Jian Zhang, Geguang Pu, Yang Liu

arXiv:2405.14189v28.717 citationsh-index: 29ACL

Originality Incremental advance

AI Analysis

This addresses a security vulnerability in LLMs for users and developers, though it is incremental as it builds on prior optimization methods by focusing on prompt organization.

The paper tackles the problem of universal goal hijacking in LLMs, where attackers force malicious responses for any user prompt, by proposing POUGH, which combines an efficient optimization algorithm with semantics-guided prompt organization strategies, achieving effective results across four LLMs and ten target response types.

Universal goal hijacking is a kind of prompt injection attack that forces LLMs to return a target malicious response for arbitrary normal user prompts. The previous methods achieve high attack performance while being too cumbersome and time-consuming. Also, they have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To this end, we propose a method called POUGH that incorporates an efficient optimization algorithm and two semantics-guided prompt organization strategies. Specifically, our method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes them. Given the sequentially ranked prompts, our method employs an iterative optimization algorithm to generate a fixed suffix that can concatenate to arbitrary user prompts for universal goal hijacking. Experiments conducted on four popular LLMs and ten types of target responses verified the effectiveness.

View on arXiv PDF

Similar