CRAIDCApr 3, 2024

Vocabulary Attack to Hijack Large Language Model Applications

arXiv:2404.02637v217 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in LLM applications for users and developers, presenting a novel attack method that is incremental in its optimization approach.

The paper tackles the problem of hijacking large language model applications by inserting optimized vocabulary words into instructions, proving its approach by successfully goal hijacking two popular open-source LLMs (Llama2 and Flan-T5 families) with inconspicuous attacks, often requiring only a single word insertion.

The fast advancements in Large Language Models (LLMs) are driving an increasing number of applications. Together with the growing number of users, we also see an increasing number of attackers who try to outsmart these systems. They want the model to reveal confidential information, specific false information, or offensive behavior. To this end, they manipulate their instructions for the LLM by inserting separators or rephrasing them systematically until they reach their goal. Our approach is different. It inserts words from the model vocabulary. We find these words using an optimization procedure and embeddings from another LLM (attacker LLM). We prove our approach by goal hijacking two popular open-source LLMs from the Llama2 and the Flan-T5 families, respectively. We present two main findings. First, our approach creates inconspicuous instructions and therefore it is hard to detect. For many attack cases, we find that even a single word insertion is sufficient. Second, we demonstrate that we can conduct our attack using a different model than the target model to conduct our attack with.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes