CVLGApr 19, 2024

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

arXiv:2404.12588v17 citationsh-index: 7ICME
Originality Incremental advance
AI Analysis

This work addresses the problem of resource-efficient adaptation for vision-language models, offering an incremental improvement over existing adapter methods by better leveraging cross-modal cues.

The paper tackles the challenge of parameter-efficient transfer learning in vision-language models by introducing XMAdapter, which uses cross-modal retrieval and fusion to improve adaptation without training, achieving significant gains in accuracy, generalization, and efficiency on benchmark datasets.

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes