CVAIJun 5, 2025

Line of Sight: On Linear Representations in VLLMs

arXiv:2506.04706v16 citationsh-index: 20Has Code
Originality Incremental advance
AI Analysis

This provides insights into the internal representations of multimodal models, which could help researchers understand and improve their interpretability and performance.

The researchers investigated how multimodal language models represent images in their hidden activations, finding that ImageNet classes are represented via linearly decodable features in LlaVA-Next and that these features become increasingly shared between text and image modalities in deeper layers.

Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes