CR AI CL LGOct 18, 2024

Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment

Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong

arXiv:2410.14827v311.611 citationsh-index: 11Has CodeAISec@CCS

Originality Incremental advance

AI Analysis

This work addresses a critical security threat for users of aligned LLMs by introducing a more foundational attack vector, though it is incremental in building upon existing prompt injection techniques.

The paper tackles the problem of prompt injection attacks on LLMs by proposing a method to poison the alignment process, which makes models significantly more vulnerable to such attacks while maintaining normal performance on standard benchmarks.

Prompt injection attack, where an attacker injects a prompt into the original one, aiming to make an Large Language Model (LLM) follow the injected prompt to perform an attacker-chosen task, represent a critical security threat. Existing attacks primarily focus on crafting these injections at inference time, treating the LLM itself as a static target. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we introduces a more foundational attack vector: poisoning the LLM's alignment process to amplify the success of future prompt injection attacks. Specifically, we propose PoisonedAlign, a method that strategically creates poisoned alignment samples to poison an LLM's alignment dataset. Our experiments across five LLMs and two alignment datasets show that when even a small fraction of the alignment data is poisoned, the resulting model becomes substantially more vulnerable to a wide range of prompt injection attacks. Crucially, this vulnerability is instilled while the LLM's performance on standard capability benchmarks remains largely unchanged, making the manipulation difficult to detect through automated, general-purpose performance evaluations. The code for implementing the attack is available at https://github.com/Sadcardation/PoisonedAlign.

View on arXiv PDF Code

Similar