CVAILGNCMar 14, 2024

Can We Talk Models Into Seeing the World Differently?

arXiv:2403.09193v221 citationsHas CodeICLR
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding and controlling biases in multi-modal AI models for researchers and practitioners, though it is incremental as it builds on known biases from uni-modal studies.

The paper investigates how biases from vision and language components interact in vision-language models (VLMs), finding that VLMs inherit texture vs. shape biases from vision encoders but show altered visual cue processing due to multi-modal training, with active steering via language prompts being more effective for texture-based decisions than shape-based ones.

Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model's visual perception is intriguing, it raises further questions on our ability to actively steer a model's output so that its prediction is based on particular visual cues of the user's choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful. URL: https://github.com/paulgavrikov/vlm_shapebias

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes