CLAIJan 24, 2024

Fluent dreaming for language models

arXiv:2402.01702v15 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of interpreting language models for researchers and practitioners, though it is incremental as it adapts existing methods from adversarial attacks.

The paper tackled the problem of applying feature visualization ('dreaming') to language models by developing the Evolutionary Prompt Optimization (EPO) algorithm, which optimizes input prompts to maximize internal features while maintaining fluency, enabling exploration of model internals with out-of-distribution prompts.

Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete. We extend Greedy Coordinate Gradient, a method from the language model adversarial attack literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO optimizes the input prompt to simultaneously maximize the Pareto frontier between a chosen internal feature and prompt fluency, enabling fluent dreaming for language models. We demonstrate dreaming with neurons, output logits and arbitrary directions in activation space. We measure the fluency of the resulting prompts and compare language model dreaming with max-activating dataset examples. Critically, fluent dreaming allows automatically exploring the behavior of model internals in reaction to mildly out-of-distribution prompts. Code for running EPO is available at https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating code usage is at https://confirmlabs.org/posts/dreamy.html

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes