CVAICLLGJul 23, 2024

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

arXiv:2407.16526v11 citationsh-index: 21Has Code
Originality Incremental advance
AI Analysis

This work addresses sub-optimal performance in VLMs for tasks like visual question answering and image captioning, offering an incremental improvement by fine-tuning vision encoders more effectively.

The paper tackles the problem of image understanding errors in vision-language models (VLMs) caused by frozen vision encoders like CLIP, proposing an efficient and robust tuning method that selectively updates encoders to achieve substantial performance improvements on previously erroneous data while maintaining overall robustness.

Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and computational efficiency characterize our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes