CVAILGApr 3, 2025

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

arXiv:2504.02821v243 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This work addresses AI safety by improving interpretability and control for users of VLMs, though it is incremental as it extends existing SAE methods from LLMs to VLMs.

The paper tackled the problem of enhancing interpretability and steerability in Vision-Language Models (VLMs) by applying Sparse Autoencoders (SAEs), showing that SAEs significantly improve the monosemanticity of individual neurons in VLMs like CLIP and enable direct steering of multimodal LLM outputs without model modifications.

Given that interpretability and steerability are crucial to AI safety, Sparse Autoencoders (SAEs) have emerged as a tool to enhance them in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Notably, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code is available at https://github.com/ExplainableML/sae-for-vlm.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes