SDCLLGASSep 18, 2025

Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data

arXiv:2509.15389v12 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the problem of adapting large audio models for practical spoken language understanding tasks under realistic data constraints, which is incremental but provides practical insights.

The study tackled fine-tuning Large Audio Language Models for spoken language understanding with limited speech data, finding that text-only fine-tuning achieves competitive performance and adding small amounts of speech data (2-5%) yields substantial gains, with curriculum learning being particularly effective under scarce data.

Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes