ParaScopes: What do Language Models Activations Encode About Future Text?
This work addresses the need for better monitoring and understanding of language models' longer-term planning capabilities, which is incremental as it builds on existing interpretability methods.
The researchers tackled the problem of understanding what language model activations encode about future text by developing Residual Stream Decoders to probe for paragraph-scale and document-scale plans, finding that information equivalent to 5+ tokens of future context can be decoded in small models.
Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.