CL LGAug 20, 2023

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

arXiv:2308.10248v535.9693 citationsh-index: 9Has Code

Originality Highly original

AI Analysis

This provides a lightweight, inference-time method for controlling high-level properties like sentiment and topic in language models, enabling rapid iteration without machine optimization.

The paper tackles the problem of controlling language model outputs by introducing activation engineering, specifically the Activation Addition (ActAdd) technique, which modifies activations at inference time to steer outputs, achieving state-of-the-art results on tasks like negative-to-positive sentiment shift and detoxification using models such as LLaMA-3 and OPT.

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

View on arXiv PDF Code

Similar