SDAICLASMar 23

Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs

arXiv:2601.1249472.71 citationsh-index: 2
AI Analysis

This work addresses the problem of adapting audio LLMs for linguistically complex environments, offering practical guidance for low-resource applications, though it is incremental in nature.

The paper tackled the challenge of adapting audio large language models to low-resource, dialect-rich Arabic-English settings by studying multi-task instruction tuning strategies, finding that a two-stage TPC->ADS approach provided the most reliable balance across tasks like ASR and emotion recognition.

Audio large language models (LLMs) enable unified speech understanding and generation, but adapting them to linguistically complex and dialect-rich settings such as Arabic-English remains challenging. We present a controlled study of multi-task instruction tuning for an Arabic-centric audio LLM across generative tasks including ASR and speech and text summarization, and discriminative tasks including dialect and emotion recognition, in a resource-constrained setting. To support end-to-end Arabic speech summarization, we introduce AraMega-SSum, a first speech summarization resource for training and benchmarking Arabic-centric Audio-LLMs. We compare four training strategies (i) Uniform Task Mixing, (ii) Task-Progressive Curriculum (TPC), (iiii) Aligner-Based Diverse Sampling (ADS) for training-time batch construction, and (iv) A two-stage TPC->ADS strategy. Our results show a clear efficiency-robustness trade-off. ADS speeds up early convergence and improves paralinguistic performance, however, it hurts other tasks. A two-stage TPC-> ADS strategy gives the most reliable overall balance across tasks, offering practical guidance for adapting omni audio LLMs to low-resource, dialect-rich environments. We will make AraMega-SSum and all experimental resources publicly available to the community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes