CLAILGDec 11, 2024

Evil twins are not that evil: Qualitative insights into machine-generated prompts

arXiv:2412.08127v46 citationsh-index: 10Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of understanding and mitigating opaque prompts in language models, which is crucial for preventing harmful uses like jailbreaking, though it is incremental in providing qualitative insights rather than a solution.

The paper analyzed machine-generated prompts (autoprompts) across six language models, finding that these prompts are not entirely opaque, with identifiable influential tokens and structural patterns like filler and keyword tokens. It revealed that autoprompts share processing similarities with natural language inputs, suggesting they emerge from general LM mechanisms.

It has been widely observed that language models (LMs) respond in predictable ways to algorithmically generated prompts that are seemingly unintelligible. This is both a sign that we lack a full understanding of how LMs work, and a practical challenge, because opaqueness can be exploited for harmful uses of LMs, such as jailbreaking. We present the first thorough analysis of opaque machine-generated prompts, or autoprompts, pertaining to 6 LMs of different sizes and families. We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation. A small but consistent proportion of the previous tokens are prunable, probably appearing in the prompt as a by-product of the fact that the optimization process fixes the number of tokens. The remaining tokens fall into two categories: filler tokens, which can be replaced with semantically unrelated substitutes, and keywords, that tend to have at least a loose semantic relation with the generation, although they do not engage in well-formed syntactic relations with it. Additionally, human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque. Finally, some of the ablations we applied to autoprompts yield similar effects in natural language inputs, suggesting that autoprompts emerge naturally from the way LMs process linguistic inputs in general.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes