SDCLLGASApr 27, 2024

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

arXiv:2404.17806v131 citationsh-index: 65MLSP
Originality Incremental advance
AI Analysis

This addresses a bottleneck in audio-language alignment for tasks such as audio retrieval and generation, representing an incremental improvement over existing CLAP methods.

The paper tackles the problem of current contrastive language-audio pretraining (CLAP) models struggling to capture temporal information in audio and text, which limits performance in tasks like retrieval and generation. The result is T-CLAP, a temporal-enhanced model that significantly outperforms state-of-the-art models in multiple downstream tasks.

Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes