CRAILGDec 12, 2023

Maatphor: Automated Variant Analysis for Prompt Injection Attacks

arXiv:2312.11513v121 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses a serious security threat for LLM users by providing a tool to test defenses against variants of prompt injections, though it is incremental as it builds on existing defense methods.

The paper tackles the problem of defending against prompt injection attacks on large language models by introducing Maatphor, a tool that automates variant analysis, which consistently generates variants with at least 60% effectiveness from an ineffective seed prompt within 40 iterations.

Prompt injection has emerged as a serious security threat to large language models (LLMs). At present, the current best-practice for defending against newly-discovered prompt injection techniques is to add additional guardrails to the system (e.g., by updating the system prompt or using classifiers on the input and/or output of the model.) However, in the same way that variants of a piece of malware are created to evade anti-virus software, variants of a prompt injection can be created to evade the LLM's guardrails. Ideally, when a new prompt injection technique is discovered, candidate defenses should be tested not only against the successful prompt injection, but also against possible variants. In this work, we present, a tool to assist defenders in performing automated variant analysis of known prompt injection attacks. This involves solving two main challenges: (1) automatically generating variants of a given prompt according, and (2) automatically determining whether a variant was effective based only on the output of the model. This tool can also assist in generating datasets for jailbreak and prompt injection attacks, thus overcoming the scarcity of data in this domain. We evaluate Maatphor on three different types of prompt injection tasks. Starting from an ineffective (0%) seed prompt, Maatphor consistently generates variants that are at least 60% effective within the first 40 iterations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes