SDAIASSep 24, 2024

Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM

arXiv:2409.15905v211 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the problem of accurately transcribing speech that mixes multiple languages, which is crucial for multilingual applications, though it appears incremental in its approach.

The paper tackles the challenge of Code-Switching in Automatic Speech Recognition by introducing a speech-conditioned LLM with a Mixture of Experts connector and an IDIT mechanism, achieving significant performance improvements over state-of-the-art models.

In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). Specifically, we propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task. We also present a connecter with MoE architecture that manages multiple languages efficiently. To further enhance the collaboration of multiple experts and leverage the understanding capabilities of LLM, we propose a two-stage progressive training strategy: 1) The connector is unfrozen and trained with language-specialized experts to map speech representations to the text space. 2) The connector and LLM LoRA adaptor are trained with the proposed IDIT mechanism and all experts are activated to learn general representations. Experimental results demonstrate that our method significantly outperforms state-of-the-art models, including end-to-end and large-scale audio-language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes