LGJun 25, 2024

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Van Tung Pham, Yist Lin, Tao Han, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

arXiv:2406.17272v14.63 citations

Originality Incremental advance

AI Analysis

This work provides an incremental improvement for ASR systems by enhancing the integration of speech encoders and LLMs, particularly in domain mismatch conditions.

The paper tackled the problem of connecting speech encoders to large language models for ASR by addressing limitations like limited fine-tuning, poor speech-text alignment, and high insertion errors, resulting in improved performance with methods such as partial fine-tuning using LoRA, a matching loss for better alignment, and reduced insertion errors on the Librispeech corpus.

Recent works have shown promising results in connecting speech encoders to large language models (LLMs) for speech recognition. However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. We begin by investigating more thoughtful fine-tuning schemes. Next, we propose a matching loss to enhance alignment between modalities. Finally, we explore training and inference methods to mitigate high insertion errors. Experimental results on the Librispeech corpus demonstrate that partially fine-tuning the encoder and LLM using parameter-efficient methods, such as LoRA, is the most cost-effective approach. Additionally, the matching loss improves modality alignment, enhancing performance. The proposed training and inference methods significantly reduce insertion errors.

View on arXiv PDF

Similar