CLSDApr 25

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

arXiv:2604.2332367.3h-index: 2
AI Analysis

For researchers in audio-text retrieval, this work offers an incremental improvement in handling noisy and long-form audio with small-batch training.

The paper tackles robust audio-text retrieval for long, noisy, and weakly labeled audio, proposing a cross-modal embedding refinement module and hybrid loss that improve retrieval accuracy by up to 5% over prior methods on benchmark datasets.

Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes