CLIRSDASDec 17, 2024

CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

arXiv:2412.13071v21 citationsh-index: 6ECIR
Originality Incremental advance
AI Analysis

It addresses the challenge of retrieving information across languages and modalities (audio and text) for applications like search and analysis, representing an incremental improvement over existing methods.

This study tackled the problem of multilingual multimodal information retrieval by introducing CLASP, a contrastive language-speech pretraining model, which established new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based methods.

This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval methods that rely on transcribing speech into text for subsequent text retrieval, especially in specific scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes