SDMay 14

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara

arXiv:2605.1434061.2

Predicted impact top 40% in SD · last 90 daysOriginality Incremental advance

AI Analysis

For ASR researchers needing to adapt LLM-based models to new domains without paired audio-text data, this method offers a more effective solution than prior text-only approaches.

The paper addresses text-only domain adaptation for LLM-based ASR by generating pseudo-audio prompts that incorporate speech-text alignment, achieving improved error rates and out-of-vocabulary coverage compared to existing methods.

LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.

View on arXiv PDF

Similar