CLSDASNov 2, 2022

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

MIT
arXiv:2211.01180v211 citationsh-index: 52
Originality Incremental advance
AI Analysis

This addresses the problem of cross-modal retrieval for non-English speakers, though it is incremental as it builds on existing pre-trained models.

The paper tackled multilingual speech-to-image retrieval by adapting English-only pre-trained models (CLIP and HuBERT) to non-English languages, achieving state-of-the-art performance with significant margins in both separate and single-model setups.

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differences in model behavior and performance between English and non-English settings, attributable to the English-only pre-training of CLIP and HuBERT, and investigate how fine-tuning the pre-trained models impacts these differences. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes