CVCLMay 30, 2023

Scalable Performance Analysis for Vision-Language Models

arXiv:2305.18786v2221 citationsHas Code
Originality Incremental advance
AI Analysis

This provides a scalable tool for researchers to analyze semantic errors in vision-language models, though it is incremental as it builds on prior probing benchmarks.

The paper tackles the problem of understanding limitations in vision-language models by introducing a scalable analysis method using existing annotated benchmarks, revealing that CLIP behaves like a bag-of-words model and gets confused by concrete words.

Joint vision-language models have shown great performance over a diverse set of tasks. However, little is known about their limitations, as the high dimensional space learned by these models makes it difficult to identify semantic errors. Recent work has addressed this problem by designing highly controlled probing task benchmarks. Our paper introduces a more scalable solution that relies on already annotated benchmarks. Our method consists of extracting a large set of diverse features from a vision-language benchmark and measuring their correlation with the output of the target model. We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs; we also uncover novel insights such as CLIP getting confused by concrete words. Our framework is available at https://github.com/MichiganNLP/Scalable-VLM-Probing and can be used with other multimodal models and benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes