CVAIMay 23, 2024

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

arXiv:2405.14715v32 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the practical issue of model upgrades for cross-modal retrieval systems, offering an efficient solution to avoid backfilling, though it is incremental as it extends existing backward-compatible training from vision-only to cross-modal settings.

The paper tackles the problem of upgrading vision-language models for cross-modal retrieval without costly backfilling by proposing Cross-modal Backward-compatible Training (XBT), which uses a projection module pretrained with text data to align new model embeddings with old ones, reducing required image-text pairs by up to 90% and achieving competitive retrieval performance.

Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes