VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models
This work provides a tool for systematic debugging and explanation of failure modes in high-performing vision language models, which is important for researchers and developers working on model interpretability.
This paper introduces VisualScratchpad, an interactive interface that uses sparse autoencoders on the vision encoder and links visual concepts to text tokens via text-to-image attention. This allows for the analysis of which visual concepts are captured and utilized by vision language models during inference, revealing failure modes like limited cross-modal alignment and misleading visual concepts.
High-performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. We apply sparse autoencoders to the vision encoder and link the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model. VisualScratchpad also provides a token-latent heatmap view that suggests a sufficient set of latents for effective concept ablation in causal analysis. Through case studies, we reveal three underexplored failure modes: limited cross-modal alignment, misleading visual concepts, and unused hidden cues. Project page: https://hyesulim.github.io/visual_scratchpad_projectpage/