SDCLASMar 25, 2022

Audio-text Retrieval in Context

arXiv:2203.13645v235 citationsh-index: 68
AI Analysis

This work addresses the problem of retrieving audio based on natural language descriptions for applications in multimedia and AI, but it is incremental as it builds on existing methods with optimizations.

The paper tackled audio-text retrieval by exploring audio features and sequence aggregation methods to improve cross-modality alignment, achieving significant improvements in recall, median, and mean rank on AudioCaps and CLOTHO datasets compared to the previous state-of-the-art.

Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes