Text-only adaptation in LLM-based ASR through text denoising
This addresses the challenge of domain adaptation in ASR for users needing efficient updates without audio data, though it is incremental as it builds on existing LLM-based ASR frameworks.
The paper tackles the problem of adapting LLM-based ASR systems to new domains using only text data, which often degrades performance by disrupting speech-text alignment. It introduces a text denoising method that emulates audio projection, achieving up to 22.1% relative improvement and outperforming state-of-the-art text-only adaptation methods.
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.