SD CL ASJun 11, 2024

Bridging Language Gaps in Audio-Text Retrieval

Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

arXiv:2406.07012v213.012 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the problem of linguistic disparities in audio-text retrieval for multilingual applications, representing an incremental improvement by extending existing methods to non-English content.

The paper tackles the limitation of English-only audio-text retrieval models by proposing a language enhancement method using a multilingual text encoder and consistent ensemble distillation, achieving state-of-the-art performance on English datasets and promising results in seven other languages with only 10% additional training data.

Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap.

View on arXiv PDF Code

Similar