LGCLSep 11, 2024

Representation Tuning

arXiv:2409.06927v42 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses safety concerns in AI by providing a more effective method for behavior control in LLMs, though it is incremental as it builds on existing activation engineering techniques.

The paper tackles the problem of controlling honesty in large language models by tuning activation vectors directly into the model, eliminating the need for online steering; the result shows that this approach yields a stronger effect and better generalization compared to online steering and standard fine-tuning.

Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-source LLM (Llama-2-13b-chat). Next, we demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, we show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, we compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning. Tuned models are available at https://huggingface.co/collections/cackerman/representation-tuning-66da1e5ab41cd1b824687d9f.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes