Line of Sight: On Linear Representations in VLLMs
This provides insights into the internal representations of multimodal models, which could help researchers understand and improve their interpretability and performance.
The researchers investigated how multimodal language models represent images in their hidden activations, finding that ImageNet classes are represented via linearly decodable features in LlaVA-Next and that these features become increasingly shared between text and image modalities in deeper layers.
Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.