Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System
This work addresses the need for detecting generated text and revealing biases in decoding settings, which is important for users and developers of language models, though it is incremental in applying existing reverse-engineering methods to a new context.
The paper tackles the problem of identifying the decoding strategy (e.g., top-k or nucleus sampling) used in blackbox language generation systems, achieving successful reverse-engineering on open-source models and production systems like ChatGPT.
Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-$k$ or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).