CLCVMay 23, 2024

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization

arXiv:2405.14189v217 citationsh-index: 29ACL
Originality Incremental advance
AI Analysis

This addresses a security vulnerability in LLMs for users and developers, though it is incremental as it builds on prior optimization methods by focusing on prompt organization.

The paper tackles the problem of universal goal hijacking in LLMs, where attackers force malicious responses for any user prompt, by proposing POUGH, which combines an efficient optimization algorithm with semantics-guided prompt organization strategies, achieving effective results across four LLMs and ten target response types.

Universal goal hijacking is a kind of prompt injection attack that forces LLMs to return a target malicious response for arbitrary normal user prompts. The previous methods achieve high attack performance while being too cumbersome and time-consuming. Also, they have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To this end, we propose a method called POUGH that incorporates an efficient optimization algorithm and two semantics-guided prompt organization strategies. Specifically, our method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes them. Given the sequentially ranked prompts, our method employs an iterative optimization algorithm to generate a fixed suffix that can concatenate to arbitrary user prompts for universal goal hijacking. Experiments conducted on four popular LLMs and ten types of target responses verified the effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes