CVJul 13, 2023

Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

arXiv:2307.06795v13 citationsh-index: 38Has Code
Originality Incremental advance
AI Analysis

This addresses a specific limitation in fine-grained computer vision for researchers and practitioners, though it appears incremental as it builds directly on existing CLIP architecture.

The paper tackles the problem of vision-language foundation models struggling with fine-grained downstream tasks like attribute detection and localization by proposing a multitask fine-tuning strategy with positive/negative prompts. The result shows strong improvements on bird fine-grained tasks and increased classification performance on the CUB200-2011 dataset.

Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes