LGJun 25, 2024

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

arXiv:2406.17272v13 citations
Originality Incremental advance
AI Analysis

This work provides an incremental improvement for ASR systems by enhancing the integration of speech encoders and LLMs, particularly in domain mismatch conditions.

The paper tackled the problem of connecting speech encoders to large language models for ASR by addressing limitations like limited fine-tuning, poor speech-text alignment, and high insertion errors, resulting in improved performance with methods such as partial fine-tuning using LoRA, a matching loss for better alignment, and reduced insertion errors on the Librispeech corpus.

Recent works have shown promising results in connecting speech encoders to large language models (LLMs) for speech recognition. However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. We begin by investigating more thoughtful fine-tuning schemes. Next, we propose a matching loss to enhance alignment between modalities. Finally, we explore training and inference methods to mitigate high insertion errors. Experimental results on the Librispeech corpus demonstrate that partially fine-tuning the encoder and LLM using parameter-efficient methods, such as LoRA, is the most cost-effective approach. Additionally, the matching loss improves modality alignment, enhancing performance. The proposed training and inference methods significantly reduce insertion errors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes