CVAIOct 30, 2023

MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

arXiv:2310.19654v330 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the need for efficient image-text retrieval on mobile devices, offering a practical solution for industry deployment, though it is incremental as it builds on existing model structures.

The paper tackles the challenge of deploying large visual-language pretraining models on mobile devices by proposing MCAD, a distillation technique that combines the strengths of single- and dual-stream models to enhance retrieval performance without increasing inference complexity, achieving ~100MB memory usage and ~8.0ms latency on mobile chips.

Due to the success of large-scale visual-language pretraining (VLP) models and the widespread use of image-text retrieval in industry areas, it is now critically necessary to reduce the model size and streamline their mobile-device deployment. Single- and dual-stream model structures are commonly used in image-text retrieval with the goal of closing the semantic gap between textual and visual modalities. While single-stream models use deep feature fusion to achieve more accurate cross-model alignment, dual-stream models are better at offline indexing and fast inference.We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models. By incorporating the fused single-stream features into the image and text features of the dual-stream model, we formulate new modified teacher similarity distributions and features. Then, we conduct both distribution and feature distillation to boost the capability of the student dual-stream model, achieving high retrieval performance without increasing inference complexity.Extensive experiments demonstrate the remarkable performance and high efficiency of MCAD on image-text retrieval tasks. Furthermore, we implement a lightweight CLIP model on Snapdragon/Dimensity chips with only $\sim$100M running memory and $\sim$8.0ms search latency, achieving the mobile-device application of VLP models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes