CL LG SD ASJun 18, 2024

Improving Text-To-Audio Models with Synthetic Captions

Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

arXiv:2406.15487v29.621 citationsh-index: 60

Originality Highly original

AI Analysis

This work addresses the problem of limited training data for text-to-audio models, which is crucial for researchers and developers in audio generation, by providing a scalable method to enhance caption quality, though it is incremental as it builds on prior caption augmentation methods.

The paper tackles the challenge of obtaining high-quality captions for training text-to-audio models by proposing an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions at scale, resulting in significant improvements in audio generation quality and achieving state-of-the-art performance on benchmarks like AudioCaps and MusicCaps.

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

View on arXiv PDF

Similar