Audio-Language Models for Audio-Centric Tasks: A Systematic Survey

Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, Yong Dou

arXiv:2501.1517795.34 citationsh-index: 12

Predicted impact top 4% in SD · last 90 daysOriginality Synthesis-oriented

AI Analysis

This is an incremental survey that helps researchers and practitioners in audio-centric AI by summarizing existing technologies and providing references for practical applications.

The paper tackles the lack of systematic surveys on Audio-Language Models (ALMs) by presenting the first comprehensive review that organizes developments across speech, music, and sound, establishing a unified taxonomy and research landscape to aid understanding and future trends.

Audio-Language Models (ALMs), trained on paired audio-text data, are designed to process, understand, and reason about audio-centric multimodal content. Unlike traditional supervised approaches that use predefined labels, ALMs leverage natural language supervision to better handle complex real-world audio scenes with multiple overlapping events. While demonstrating impressive zero-shot and task generalization capabilities, there is still a notable lack of systematic surveys that comprehensively organize and analyze developments. In this paper, we present the first systematic review of ALMs with three main contributions: (1) comprehensive coverage of ALM works across speech, music, and sound from a general audio perspective; (2) a unified taxonomy of ALM foundations, including model architectures and training objectives; (3) establishment of a research landscape capturing mutual promotion and constraints among different research aspects, aiding in summarizing evaluations, limitations, concerns and promising directions. Our review contributes to helping researchers understand the development of existing technologies and future trends, while also providing valuable references for implementation in practical applications.

View on arXiv PDF

Similar