SDMar 27

TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

arXiv:2603.0509476.7h-index: 11
AI Analysis

This work addresses the scarcity of specialized corpora for localized audio-language modeling, particularly for Taiwanese dialectal prosody, representing an incremental advancement in domain-specific applications.

The authors tackled the problem of large audio-language models struggling with localized dialectal prosody by creating TW-Sound580K, a Taiwanese audio-text dataset, and demonstrated its utility with Tai-LALM, which achieved 49.1% accuracy on the TAU Benchmark, a 6.5% absolute improvement over the baseline.

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes