SDCVMMASSep 20, 2023

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning

arXiv:2309.11500v461 citationsh-index: 22
Originality Incremental advance
AI Analysis

This provides a high-quality dataset for audio-language AI research, addressing a bottleneck in the field, though it is incremental as it builds on existing multimodal data collection methods.

The authors tackled the problem of limited audio datasets for representation learning by creating Auto-ACD, a large-scale dataset with over 1.5 million audio-text pairs, which improved performance on tasks like audio-language retrieval and audio captioning.

Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes