CLOct 31, 2024

Pseudo-Conversation Injection for LLM Goal Hijacking

arXiv:2410.23678v14 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in LLMs for users and developers, though it is incremental as it builds on prior goal hijacking methods.

The paper tackles the problem of goal hijacking attacks on Large Language Models (LLMs) by introducing Pseudo-Conversation Injection, a method that manipulates models into ignoring user inputs and generating predetermined outputs, achieving significantly higher attack effectiveness on platforms like ChatGPT and Qwen compared to existing approaches.

Goal hijacking is a type of adversarial attack on Large Language Models (LLMs) where the objective is to manipulate the model into producing a specific, predetermined output, regardless of the user's original input. In goal hijacking, an attacker typically appends a carefully crafted malicious suffix to the user's prompt, which coerces the model into ignoring the user's original input and generating the target response. In this paper, we introduce a novel goal hijacking attack method called Pseudo-Conversation Injection, which leverages the weaknesses of LLMs in role identification within conversation contexts. Specifically, we construct the suffix by fabricating responses from the LLM to the user's initial prompt, followed by a prompt for a malicious new task. This leads the model to perceive the initial prompt and fabricated response as a completed conversation, thereby executing the new, falsified prompt. Following this approach, we propose three Pseudo-Conversation construction strategies: Targeted Pseudo-Conversation, Universal Pseudo-Conversation, and Robust Pseudo-Conversation. These strategies are designed to achieve effective goal hijacking across various scenarios. Our experiments, conducted on two mainstream LLM platforms including ChatGPT and Qwen, demonstrate that our proposed method significantly outperforms existing approaches in terms of attack effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes