CLLGMay 23, 2024

Extracting Prompts by Inverting LLM Outputs

arXiv:2405.15012v252 citationsh-index: 8EMNLP
Originality Incremental advance
AI Analysis

This addresses a security and privacy issue for users of large language models by enabling prompt extraction from normal outputs.

The paper tackles the problem of extracting prompts from language model outputs by developing a black-box method called output2prompt, which works without logits or adversarial queries and achieves zero-shot transferability across different LLMs.

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes