ASSDMar 25

ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

arXiv:2603.2403895.81 citationsh-index: 23Has Code
Predicted impact top 2% in AS · last 90 daysOriginality Synthesis-oriented
AI Analysis

This addresses the need for better training data to develop versatile large audio-language models for general audio understanding, representing an incremental improvement through a new dataset.

The authors tackled the problem of limited scale and descriptive granularity in audio captioning datasets by introducing ACAVCaps, a large-scale, fine-grained dataset derived from ACAV100M using a multi-expert pipeline and large language model synthesis. Models pre-trained on ACAVCaps showed substantially stronger generalization on downstream tasks compared to other datasets.

General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets. The dataset is available at https://github.com/xiaomi-research/acavcaps.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes