Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation
This work addresses domain adaptation for speech recognition, enabling better performance in new domains without requiring additional speech data, though it is incremental as it builds on existing modality alignment methods.
The paper tackles the problem of adapting end-to-end speech recognition to new domains using only text data by proposing a down-sampling strategy for acoustic representations to align with text length, resulting in improved performance on new domain data as demonstrated in experiments.
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.