CL CY LGDec 11, 2024

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Alexander Pan, Lijie Chen, Jacob Steinhardt

arXiv:2412.08686v114.435 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses interpretability challenges in AI by making model activations more human-readable, though it builds incrementally on existing instruction tuning methods.

The paper tackles the problem of interpreting language model activations by introducing LatentQA, a task that involves answering open-ended questions about activations in natural language, and proposes Latent Interpretation Tuning (LIT) to finetune a decoder LLM for this purpose, enabling applications like extracting relational knowledge, controlling model behavior, and revealing harmful capabilities.

Interpretability methods seek to understand language model representations, yet the outputs of most such methods -- circuits, vectors, scalars -- are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.

View on arXiv PDF

Similar