SDMMASMar 12

Audio-Language Models for Audio-Centric Tasks: A Systematic Survey

arXiv:2501.1517795.34 citationsh-index: 12
Predicted impact top 4% in SD · last 90 daysOriginality Synthesis-oriented
AI Analysis

This is an incremental survey that helps researchers and practitioners in audio-centric AI by summarizing existing technologies and providing references for practical applications.

The paper tackles the lack of systematic surveys on Audio-Language Models (ALMs) by presenting the first comprehensive review that organizes developments across speech, music, and sound, establishing a unified taxonomy and research landscape to aid understanding and future trends.

Audio-Language Models (ALMs), trained on paired audio-text data, are designed to process, understand, and reason about audio-centric multimodal content. Unlike traditional supervised approaches that use predefined labels, ALMs leverage natural language supervision to better handle complex real-world audio scenes with multiple overlapping events. While demonstrating impressive zero-shot and task generalization capabilities, there is still a notable lack of systematic surveys that comprehensively organize and analyze developments. In this paper, we present the first systematic review of ALMs with three main contributions: (1) comprehensive coverage of ALM works across speech, music, and sound from a general audio perspective; (2) a unified taxonomy of ALM foundations, including model architectures and training objectives; (3) establishment of a research landscape capturing mutual promotion and constraints among different research aspects, aiding in summarizing evaluations, limitations, concerns and promising directions. Our review contributes to helping researchers understand the development of existing technologies and future trends, while also providing valuable references for implementation in practical applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes