CVLGDec 30, 2024

Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks

arXiv:2412.20682v1h-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of model selection in resource-constrained settings, offering an incremental improvement over existing methods by eliminating the need for supervised data or large language models.

The paper tackles the problem of selecting the best pre-trained vision-language model for downstream tasks without labeled data, proposing VEGA, which measures alignment between visual and textual graphs to achieve reliable performance estimates across various benchmarks.

Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks. However, selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial. Existing VLM selection methods focus on the class-name-only setting, relying on a supervised large-scale dataset and large language models, which may not be accessible or feasible during deployment. This paper introduces the problem of \textbf{unsupervised vision-language model selection}, where only unsupervised downstream datasets are available, with no additional information provided. To solve this problem, we propose a method termed Visual-tExtual Graph Alignment (VEGA), to select VLMs without any annotations by measuring the alignment of the VLM between the two modalities on the downstream task. VEGA is motivated by the pretraining paradigm of VLMs, which aligns features with the same semantics from the visual and textual modalities, thereby mapping both modalities into a shared representation space. Specifically, we first construct two graphs on the vision and textual features, respectively. VEGA is then defined as the overall similarity between the visual and textual graphs at both node and edge levels. Extensive experiments across three different benchmarks, covering a variety of application scenarios and downstream datasets, demonstrate that VEGA consistently provides reliable and accurate estimates of VLMs' performance on unlabeled downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes