CVJul 5, 2024

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

arXiv:2407.04603v234 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently transferring vision-language models to new domains, offering a practical solution for applications with limited data, though it is incremental in building on existing adaptation methods.

The paper tackles the problem of adapting pre-trained vision-language models to new concepts with limited information by introducing the AWT framework, which enhances zero-shot capabilities without additional training and achieves state-of-the-art performance in tasks like image classification and video action recognition.

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes