CLJan 2, 2024

DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever

arXiv:2401.01076v24 citationsh-index: 23ICASSP
AI Analysis

This work addresses the challenge of improving retrieval in multi-modal dialog systems for applications like chatbots, though it is incremental as it builds on existing CLIP models with prompt-tuning.

The paper tackles the problem of multi-modal dialog retrieval by proposing DialCLIP, a parameter-efficient prompt-tuning method that enhances CLIP to better capture dialog context, achieving state-of-the-art performance on benchmark datasets like PhotoChat and MMDialog while tuning only 0.04% of parameters.

Recently, substantial advancements in pre-trained vision-language models have greatly enhanced the capabilities of multi-modal dialog systems. These models have demonstrated significant improvements by fine-tuning on downstream tasks. However, the existing pre-trained models primarily focus on effectively capturing the alignment between vision and language modalities, often ignoring the intricate nature of dialog context. In this paper, we propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Specifically, our approach introduces a multi-modal context prompt generator to learn context features which are subsequently distilled into prompts within the pre-trained vision-language model CLIP. Besides, we introduce domain prompt to mitigate the disc repancy from the downstream dialog data. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space, with each expert being responsible to one specific retrieval type. Extensive experiments show that DialCLIP achieves state-of-the-art performance on two widely recognized benchmark datasets (i.e., PhotoChat and MMDialog) by tuning a mere 0.04% of the total parameters. These results highlight the efficacy and efficiency of our proposed approach, underscoring its potential to advance the field of multi-modal dialog retrieval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes