SDAICLMMASFeb 23, 2025

Audio-FLAN: A Preliminary Release

arXiv:2502.16584v13 citationsh-index: 42Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of fragmented audio AI development for researchers and practitioners by providing a foundational dataset, though it is incremental as it builds on existing instruction-tuning methods.

The paper tackles the lack of unified audio-language models by introducing Audio-FLAN, a large-scale instruction-tuning dataset with over 100 million instances covering 80 diverse audio tasks, enabling zero-shot handling of both understanding and generation tasks across speech, music, and sound domains.

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes