ASCLSDMar 19

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

arXiv:2603.1919595.31 citationsh-index: 12
Predicted impact top 3% in AS · last 90 daysOriginality Incremental advance
AI Analysis

This work provides empirical grounding for understanding LLMs in audio research, addressing a gap in knowledge for researchers and practitioners in audio AI.

The study investigated how much auditory knowledge large language models (LLMs) encode through text-only pre-training and its impact on downstream audio tasks, finding that auditory knowledge varies across LLM families and text-only results strongly correlate with audio performance.

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes