CLLGSDASJun 18, 2024

Improving Text-To-Audio Models with Synthetic Captions

arXiv:2406.15487v221 citations
Originality Highly original
AI Analysis

This work addresses the problem of limited training data for text-to-audio models, which is crucial for researchers and developers in audio generation, by providing a scalable method to enhance caption quality, though it is incremental as it builds on prior caption augmentation methods.

The paper tackles the challenge of obtaining high-quality captions for training text-to-audio models by proposing an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions at scale, resulting in significant improvements in audio generation quality and achieving state-of-the-art performance on benchmarks like AudioCaps and MusicCaps.

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes