CVAILGMMSep 4, 2023

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

arXiv:2309.01516v311 citations
Originality Incremental advance
AI Analysis

This addresses the computational and memory demands of fine-tuning large multimodal models for scalable image-text retrieval, offering an incremental improvement over existing efficient adaptation methods.

The paper tackles the challenge of adapting large multimodal models to specialized tasks efficiently by introducing MultiWay-Adapter (MWA), which improves inter-modal alignment and reduces training time by up to 57% while adding only 2-3% more parameters.

As Multimodal Large Language Models (MLLMs) grow in size, adapting them to specialized tasks becomes increasingly challenging due to high computational and memory demands. Indeed, traditional fine-tuning methods are costly, due to the need for extensive, task-specific training. While efficient adaptation methods exist that aim to reduce these costs, in practice they suffer from shallow inter-modal alignment, which severely hurts model effectiveness. To tackle these computational challenges and improve inter-modal alignment, we introduce the MultiWay-Adapter (MWA), a novel framework featuring an 'Alignment Enhancer'. This enhancer deepens inter-modal alignment, enabling high transferability with minimal tuning effort. Our experiments show that unlike prior efficient tuning approaches, MWA maintains model effectiveness, while reducing training time by up-to 57%. MWA is also lightweight, increasing model size by only 2-3% (in terms of parameters) for state-of-the-art foundation models like BEiT-3 Large. These results demonstrate that MWA provides an efficient and effective adaptation method for MLLMs, significantly broadening their applicability.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes